...

/

Reading From Text Files

Reading From Text Files

Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:

Press + to interact
a_file = open('chinese.txt', encoding='utf-8')

Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'chinese.txt'. There are five interesting things about this filename:

  • It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
  • The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
  • The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper.
  • It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames.
  • It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it.

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.

Character encoding rears its ugly head

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).

Press + to interact
# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
file = open('chinese.txt')
a_string = file.read()
#Traceback (most recent call last):
# File "/usercode/__ed_file.py", line 4, in <module>
# a_string = file.read()
# File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
# return codecs.ascii_decode(input, self.errors)[0]
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 17: ordinal not in range(128)

What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look ...

Access this course and 1400+ top-rated courses and projects.