Unicode and UTF-8

Learn how to work with the Unicode and UTF-8 character set.

We'll cover the following...

It has been observed that underlying bytes lead to strange results when using the strlen function. We also discussed a bit of the history of character encodings in general, but one thing we did not answer was: what was the character encoding used when converting our strings to byte arrays? When we looked at those byte arrays, we could see the representation of those characters. Still, without knowing the character encoding, all of those values are meaningless since we would not know how to interpret them. The short answer we could give here is, “well, it’s Unicode,” but that only gets us so far.

Unicode

Unicode, like ASCII, is a character set that provides a mapping of characters to an integer (Unicode refers to these associations as code points). However, unlike ASCII, Unicode does not dictate how those values are stored or transferred (remember, in ASCII, characters are mapped to and persisted as single bytes).

Let’s take a look at our Traditional Chinese text again, but with the Unicode code points:

Unicode Code Points Example

Character

Code Point

Name

U+4F60

CJK Unified Ideograph-4F60

U+597D

CJK Unified Ideograph-597D

U+8352

CJK Unified Ideograph-8352

U+91CE

CJK Unified Ideograph-91CE

We can identify our four characters within the Unicode character set by referring to ...