Strings Are Unicode

Learn about encoding, decoding, and the mutability of byte strings.

At the beginning of this section, we defined strings as immutable collections of Unicode characters. This actually makes things very complicated at times, because Unicode isn’t a storage format. If we get a string of bytes from a file or a socket, for example, they won’t be in Unicode. They will, in fact, be the built-in type bytes. Bytes are the basic storage format in computing. They represent 8 bits, usually described as an integer between 0 and 255, or a hexadecimal equivalent between 0x00 and 0xFF. Bytes don’t represent anything specific; a sequence of bytes may store characters of an encoded string, or pixels in an image, or represent an integer, or part of a floating-point value.

If we print a bytes object, Python uses a canonical display that’s reasonably compact. Any of the individual byte values that map to ASCII characters are displayed as characters, while non-character ASCII bytes are printed as escapes, either a one- character escape like \n or a hex code like \x1b. We may find it odd that a byte, represented as an integer, can map to an ASCII character. But the old ASCII code defined Latin letters for many different byte values. In ASCII the character a is represented by the same byte as the integer 97, which is the hexadecimal number 0x61. All of these are an interpretation of the binary pattern 0b1100001.

Get hands-on with 1400+ tech skills courses.