Reading From Text Files
Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:
a_file = open('chinese.txt', encoding='utf-8')
Python has a built-in open()
function, which takes a filename as an argument. Here the filename is 'chinese.txt'
. There are five interesting things about this filename:
- It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the
open()
function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well. - The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
- The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper.
- It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames.
- It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it.
But that call to the open()
function didn’t stop at the filename. There’s another argument, called encoding
. Oh dear, that sounds dreadfully familiar.
Character encoding rears its ugly head
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).
# This example was created on Windows. Other platforms may# behave differently, for reasons outlined below.file = open('chinese.txt')a_string = file.read()#Traceback (most recent call last):# File "/usercode/__ed_file.py", line 4, in <module># a_string = file.read()# File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode# return codecs.ascii_decode(input, self.errors)[0]#UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 17: ordinal not in range(128)
What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look ...