Introduction to Strings

Get an overview of strings, their encoding to bytes, and decoding back to strings in Python.

We'll cover the following

Overview

Before we get involved with higher-level design patterns, let’s take a deep dive into one of Python’s most common objects: the string. We’ll see that there is a lot more to the string than meets the eye, and we’ll also cover searching strings for patterns and serializing data for storage or transmission.

All of these topics are elements of making objects persistent. Our application can create objects in files for use at a later time. We often take persistence—the ability to write data to a file and retrieve it at an arbitrary later date—for granted. Because persistence happens via files, at the byte level, via OS writes and reads, it leads to two transformations: data we have stored must be decoded into a nice, useful object collection of objects in memory; objects from memory need to be encoded to some kind of clunky text or bytes format for storage, transfer over the network, or remote invocation on a distant server.

In this chapter, we’ll look at the following topics:

  • The complexities of strings, bytes, and byte arrays
  • The ins and outs of string formatting
  • The mysterious regular expression
  • How to use the pathlib module to manage the filesystem
  • A few ways to serialize data, including Pickle and JSON

This chapter will extend the case study to examine how best to work with collections of data files. We’ll look at another serialization format, CSV, in the case study. This will help us explore alternative representations for the training and testing data. We’ll start by looking Python strings. They do so much and it’s easy to overlook the wealth of available features.

Strings

Strings are a basic primitive in Python; we’ve used them in nearly every example we’ve discussed so far. All they do is represent an immutable sequence of characters. However, though we may not have considered it before, character is a bit of an ambiguous word; can Python strings represent sequences of accented characters? Chinese characters? What about Greek, Cyrillic, or Farsi?

In Python 33, the answer is yes. Python strings are all represented in Unicode, a character definition standard that can represent virtually any character in any language on the planet (and some made-up languages and random characters as well). This is done seamlessly. So, let’s think of Python 33 strings as an immutable sequence of Unicode characters. We’ve touched on many of the ways strings can be manipulated in previous examples, but let’s quickly cover it all in one place: a crash course in string theory!

It’s very important to step away from the older encodings we used to know and love. The ASCII encoding, for example, was limited to one byte per character. Unicode has several ways to encode a character into bytes. The most popular, called UTF-8, tends to parallel the old ASCII encoding for some punctuation and letters. It’s approximately one byte per character. But, if we need one of the thousands of other Unicode characters, there may be multiple bytes involved.

The important rule is this: we encode our characters to create bytes; we decode bytes to recover the characters. The two are separated by a high fence with a gate labeled encode on one side and decode on the other. We can visualize it like this:

Get hands-on with 1300+ tech skills courses.