What is UTF-8?

svg viewer

Background

In the world of computers, data is transferred through only two bits: 1(high) and 0(low). Therefore, our high-level data needs to be encoded with low-level data so that our machines can understand, manipulate, and communicate it.

Traditionally, the most common encoding system has been ASCII. ASCII is encoded English alphabets, numbers, symbols, and some special characters. With time, the need to incorporate more languages, symbols, characters, scripts, and even emoticons arose, and the need for a new encoding system became imminent.

The Unicode Consortium was incorporated in January 1991 in the state of California, four years after the concept of a new character encoding, to be called Unicode, was broached in discussions started by engineers from Xerox (Joe Becker) and Apple (Lee Collins and Mark Davis). Fast-forward to the 21st century, the two most popular ways to encode data is by using UTF-8 and UTF-16. Below is a graph showing how UTF-8 has grown in popularity since 2006.

svg viewer

Structure

The memory structure looks something like this:

  • 1 Byte - ASCII
  • 2 Byte - Latin scripts, including, Greek, Cyrillic, Coptic, and more.
  • 3 Byte - Basic Multilingual Planecontains characters for almost all modern languages and a large number of symbols.
  • 4 Byte - Historic scripts, mathematical symbols, and emoticons.
Number of Bytes Byte 1 Byte 2 Byte 3 Byte 4
1 0xxx xxxx
2 110x xxxx 10xx xxxx
3 1110 xxxx 10xx xxxx 10xx xxxx
4 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

The table above shows the basic structure of how data is encoded in UTF-8 format. All of the x’s are replaced by 1’s and 0’s according to the code point. If the number of significant bits is no more than seven, the first line applies; if its no more than 11 bits, the second line applies, and so on.

Features

  • Backward Compatibility: The first 128 characters - ranging from 0x0000 to 0x007f - map directly onto the ASCII code point range. This means that wherever ASCII code point was used, UTF-8 can be easily replaced without any hassle.
  • Fallback and auto-detection: There is software that supports extended ASCIIcharacter encodings are either eight-bit or larger encodings that include the standard seven-bit ASCII characters, plus additional characters. encoding. These do not map directly onto UTF-8 code points. When UTF-8 detects extended ASCII, it falls back or replaces the 8-bit bytes with the appropriate code-point.
  • Libraries: When writing code, if the need to input/output non-ASCII data arises, you will need UTF-8 support. Fortunately, there are libraries that support UTF-8, such as ICU for C, C++, and Java.
  • Self-Synchronization: Go back to the table above – do you notice that the leading byte starts with 11 while the rest of the continuation​ bytes start with 10? This helps to separate character code points and avoid mistaking one character for another. An incorrect character will not be decoded if the stream of bits starts mid sequence.
Copyright ©2024 Educative, Inc. All rights reserved