In the world of computers, data is transferred through only two bits: 1(high) and 0(low). Therefore, our high-level data needs to be encoded with low-level data so that our machines can understand, manipulate, and communicate it.
Traditionally, the most common encoding system has been ASCII. ASCII is encoded English alphabets, numbers, symbols, and some special characters. With time, the need to incorporate more languages, symbols, characters, scripts, and even emoticons arose, and the need for a new encoding system became imminent.
The Unicode Consortium was incorporated in January 1991 in the state of California, four years after the concept of a new character encoding, to be called Unicode, was broached in discussions started by engineers from Xerox (Joe Becker) and Apple (Lee Collins and Mark Davis). Fast-forward to the 21st century, the two most popular ways to encode data is by using UTF-8 and UTF-16. Below is a graph showing how UTF-8 has grown in popularity since 2006.
The memory structure looks something like this:
Number of Bytes | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|
1 | 0xxx xxxx |
|||
2 | 110x xxxx |
10xx xxxx |
||
3 | 1110 xxxx |
10xx xxxx |
10xx xxxx |
|
4 | 1111 0xxx |
10xx xxxx |
10xx xxxx |
10xx xxxx |
The table above shows the basic structure of how data is encoded in UTF-8 format. All of the x’s are replaced by 1’s and 0’s according to the code point. If the number of significant bits is no more than seven, the first line applies; if its no more than 11 bits, the second line applies, and so on.
0x0000
to 0x007f
- map directly onto the ASCII code point range. This means that wherever ASCII code point was used, UTF-8 can be easily replaced without any hassle.11
while the rest of the continuation bytes start with 10
? This helps to separate character code points and avoid mistaking one character for another. An incorrect character will not be decoded if the stream of bits starts mid sequence.