...

Reading From Text Files

We'll cover the following...

Character encoding rears its ugly head
Stream objects
Reading data from a text file
Closing files
Closing files automatically
Reading data one line at a time

Press + to interact

Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'chinese.txt'. There are five interesting things about this filename:

It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper.
It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames.
It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it.

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.

Character encoding rears its ugly head

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).

Press + to interact

Python 3.5

# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
file = open('chinese.txt')
a_string = file.read()
#Traceback (most recent call last):
# File "/usercode/__ed_file.py", line 4, in <module>
# a_string = file.read()
# File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
# return codecs.ascii_decode(input, self.errors)[0]
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 17: ordinal not in range(128)

Your First Python Program

Native Datatypes

Comprehensions

Strings

Regular Expressions

Closures & Generators

Classes & Iterators

Advanced Iterators

Unit Testing

Refactoring

Files

XML

Serializing Python Objects

HTTP Web Services

Case Study: Porting chardet to Python 3

Packaging Python Libraries

Appendix : Where To Go From Here

Reading From Text Files

Character encoding rears its ugly head