Unicode and UTF-8 strings

Unicode and UTF-8

Unicode and UTF-8 are hairy subjects.

Let’s have a quick recap of Unicode and UTF-8:

  1. Unicode is an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs. Essentially it’s a big table of “code points”. It contains most (but not all) of the characters of all languages. Each code point is an index in that table which you can sometimes see specified with the U+ notation such as U+0041 for letter A.
  2. Usually code point means a character, for instance, the Chinese character ⻯ (U+2EEF), but it can be a geometric shape or a character modifier (such as an umlaut for letters like German ä, ö, and ü). For some reason, it can even be a poo icon (U+1F4A9).
  3. UTF-8 is one of the ways (and the most common one) to encode elements of that big Unicode table into actual bytes that computers can work with.
  4. A single Unicode code point can take between 1 and 4 bytes when encoded in UTF-8.
  5. Numbers and Latin letters (a-z, A-Z, 0-9) are encoded in 1 byte. Letters of many other languages will take more than 1 byte in UTF-8 encoding.

Get hands-on with 1400+ tech skills courses.