Unicode and UTF-8

Learn how to work with the Unicode and UTF-8 character set.

We'll cover the following

It has been observed that underlying bytes lead to strange results when using the strlen function. We also discussed a bit of the history of character encodings in general, but one thing we did not answer was: what was the character encoding used when converting our strings to byte arrays? When we looked at those byte arrays, we could see the representation of those characters. Still, without knowing the character encoding, all of those values are meaningless since we would not know how to interpret them. The short answer we could give here is, “well, it’s Unicode,” but that only gets us so far.

Unicode

Unicode, like ASCII, is a character set that provides a mapping of characters to an integer (Unicode refers to these associations as code points). However, unlike ASCII, Unicode does not dictate how those values are stored or transferred (remember, in ASCII, characters are mapped to and persisted as single bytes).

Let’s take a look at our Traditional Chinese text again, but with the Unicode code points:

Get hands-on with 1200+ tech skills courses.