Unicode and Strings
Learn about the support for Unicode in Perl.
Unicode is a system used to represent the characters of the world’s written languages. Most English text uses a character set of only 127 characters (which requires 7 bits of storage and fits nicely into 8-bit bytes), but it’s naïve to believe that we won’t someday need an umlaut.
Perl strings
Perl strings can represent either of two separate but related data types:
Sequences of Unicode characters
Each character has a codepoint, a unique number that identifies it in the Unicode character set.
Sequences of octets
Binary data in a sequence of octets—8-bit numbers, each of which can represent a number between 0 and 255.
Note: Why octet and not byte? An octet is unambiguously 8 bits. A byte can be fewer or more bits, depending on esoteric hardware. Assuming that one character fits in 1 byte will cause us no end of Unicode grief. Separate the idea of memory storage from character representation. Forget that we ever heard of bytes.
Unicode and binary strings
Unicode strings and binary strings look ...