Building Robust Object-Oriented Python Applications and Libraries/

...

Strings Are Unicode

Learn about encoding, decoding, and the mutability of byte strings.

We'll cover the following...

Decoding bytes to text
Encoding text to bytes
Mutable byte strings

At the beginning of this section, we defined strings as immutable collections of Unicode characters. This actually makes things very complicated at times, because Unicode isn’t a storage format. If we get a string of bytes from a file or a socket, for example, they won’t be in Unicode. They will, in fact, be the built-in type bytes. Bytes are the basic storage format in computing. They represent 8 bits, usually described as an integer between 0 and 255, or a hexadecimal equivalent between 0x00 and 0xFF. Bytes don’t represent anything specific; a sequence of bytes may store characters of an encoded string, or pixels in an image, or represent an integer, or part of a floating-point value.

If we print a bytes object, Python uses a canonical display that’s reasonably compact. Any of the individual byte values that map to ASCII characters are displayed as characters, while non-character ASCII bytes are printed as escapes, either a one- character escape like \n or a hex code like \x1b. We may find it odd that a byte, represented as an integer, can map to an ASCII character. But the old ASCII code defined Latin letters for many different byte values. In ASCII the character a is represented by the same byte as the integer 97, which is the hexadecimal number 0x61. All of these are an interpretation of the binary pattern 0b1100001.

Press + to interact

The first byte used a hexadecimal escape, \x89. The next three bytes had ASCII characters, P, N, and G. The next two characters had one-character escapes, \r and \n. The seventh byte also had a hexadecimal escape, \x1a, because there was no other encoding. The final byte is another one-character escape, \n. The eight bytes were expanded into 17 printable characters, not counting the prefix b' and the final '.

Many I/O operations only know how to deal with bytes, even if the bytes object is the encoding of textual data. It is therefore vital to know how to convert between bytes values and Unicode str values.

The problem is that there are many encodings that map bytes to Unicode text. Several are true international standards, but many others are parts of commercial offerings, making them really popular, ...

Object-Oriented Design

Objects in Python

When Objects Are Alike

Expecting the Unexpected

When to Use Object-Oriented Programming

Abstract Base Classes and Operator Overloading

Python Data Structures

Object-Oriented and Functional Programming Intersection

Strings, Serialization, and File Paths

The Iterator Pattern

Common Design Patterns

Advanced Design Patterns

Testing Object-Oriented Programs

Concurrency

Conclusion

Build a Python Airline Reservation System

Strings Are Unicode