Unicode
We'll cover the following...
Enter Unicode.
Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041
is always 'A'
, even if your language doesn’t have an 'A'
in it.
On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character‽ That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it’s wasteful even for ideograph-based languages ...