Strings Are Unicode
Learn about encoding, decoding, and the mutability of byte strings.
We'll cover the following...
At the beginning of this section, we defined strings as immutable collections of Unicode characters. This actually makes things very complicated at times, because Unicode isn’t a storage format. If we get a string of bytes from a file or a socket,
for example, they won’t be in Unicode. They will, in fact, be the built-in type bytes
. Bytes are the basic storage format in computing. They represent 8 bits, usually described as an integer between 0 and 255, or a hexadecimal equivalent between 0x00
and 0xFF
. Bytes don’t represent anything specific; a sequence of bytes may store characters of an encoded string, or pixels in an image, or represent an integer, or part of a floating-point value.
If we print a bytes object, Python uses a canonical display that’s reasonably compact. Any of the individual byte values that map to ASCII characters are displayed as characters, while non-character ASCII bytes are printed as escapes, either a one- character escape like \n
or a hex code like \x1b
. We may find it odd that a byte, represented as an integer, can map to an ASCII character. But the old ASCII code defined Latin letters for many different byte values. In ASCII the character a
is represented by the same byte as the integer 97, which is the hexadecimal number 0x61
. All of these are an interpretation of the binary pattern 0b1100001
.
print(list(map(hex, b'abc')))print(list(map(bin, b'abc')))
Here’s how the canonical display bytes might look when they have a mixture of values that have ASCII character representations and values that don’t have a simple character:
byte_data = bytes([137, 80, 78, 71, 13, 10, 26, 10])print(byte_data)
The first byte used a hexadecimal escape, \x89
. The next three bytes had ASCII characters, P
, N
, and G
. The next two characters had one-character escapes, \r
and \n
. The seventh byte also had a hexadecimal escape, \x1a
, because there was no other encoding. The final byte is another one-character escape, \n
. The eight bytes were expanded into 17 printable characters, not counting the prefix b'
and the final '
.
Many I/O operations only know how to deal with bytes
, even if the bytes
object is the encoding of textual data. It is therefore vital to know how to convert between bytes
values and Unicode str
values.
The problem is that there are many encodings that map bytes
to Unicode text. Several are true international standards, but many others are parts of commercial offerings, making them really popular, ...