...

/

A Brief History of Character Encodings

A Brief History of Character Encodings

Learn about the brief history of character encoding, including string conversion.

We'll cover the following...

Introduction to strings

For a course that aims to explore PHP strings through the lens of the Laravel string helper functions, it seems obvious to start with the most basic question first: what even are strings? The first answer that might come to people is that a string is simply a list of characters. While on the surface, this seems to satisfy the question, we are only three sentences into this course and have already run into our first set of problems, which leads to even more questions.

Difference between a character and a string
Difference between a character and a string

The first such question might be: what is a character? When we look at the text on a screen or in printed media, we naturally assume that each visually distinct unit of a piece of writing is a character. For example, if you grew up speaking and writing English, it is second nature to look at the following text and be able to quickly see that it contains sixteen distinct characters:

Hello wilderness

But what about the following?

你好荒野

Since you’re this course is digital, it is possible to highlight each character in that text, and you would ascertain that the previous text contains four characters. Let’s make doubly sure of this and use PHP’s strlen function to count the characters in this string:

PHP
<?php
// Returns 12
echo strlen('你好荒野');

The strlen function returns 12. That’s interesting! Is it broken? Let’s use it to count the number of characters in our English text:

PHP
<?php
// Returns 16
echo strlen('Hello wilderness');

This time the function returns 16, which is the correct number of characters in our text. What’s going on here? To answer this question, we should take a step back and think about how we store and represent text in computer memory.

History of character encodings

Suppose we must develop a compact way of sending messages between two groups, one that reduces the total amount of data sent but does not lose any information. Let’s say we only need to concern ourselves with words, not numbers or special characters. We could analyze previous communication between the two parties, find the frequent words and phrases, and represent them as numbers or symbols. When we want to send a new message, we will replace those words and phrases with their corresponding number or symbol. When the recipient of the message goes to read the message, they would consult the table of numbers and symbols and reverse the process to get back the original message. We would have developed an encoding scheme—a way to represent and interpret data.

When we look at the history of text data and character encodings within computer systems, it does not take long for the name ASCII, an abbreviation of American Standard Code for Information Interchange, to come up, which has its roots in telegraph codes developed at Bell.

In the early days of computing, it was widespread for each company or organization producing a computer system to go through the same exercise of creating a unique way of storing text: to make a unique character encoding system. These systems were rarely compatible, and many parties, with the support and direction of the then American Standards Association, undertook work throughout the mid to late 1990s to produce a standard way of encoding text.

Much of this work was not widely adopted until March 11, 1968, when U.S. President Lyndon B. Johnson mandated that ASCII become a U.S. federal standard. In addition to promoting ASCII to a federal standard, the President also made it a requirement that all computers and related equipment purchased by the U.S. government starting July 1, 1969, be ASCII-compatible. Fast forward from these early days of the development of the standard (although there were many revisions to it in the following years) to 1981, when IBM used it to encode text with their first personal computer.

We will skip a lot more history and nuance, but we have enough context to proceed. Let’s now work to understand what all of this has to do with the wrong results of the strlen function when attempting to count the number of characters in our piece of Traditional Chinese text. The first fact to know is that ASCII was initially devised as a 7-bit standard and provided an encoding for 128 characters. The nice thing about this is that each character is easily represented by a single byte within a computer. Or, put another way, each character is mapped to an integer between 0 and 255, the maximum value we can represent with a single byte.

If we were to convert our English string into a byte array using the following code:

PHP
<?php
$bytes = unpack('C*', 'Hello wilderness');
print_r($bytes);

If we were to look up the ASCII table for those values, we would find that those numbers code for each of the characters in our original string. Let’s now look at the results of the following:

PHP
<?php
$bytes = unpack('C*', '你好荒野');
print_r($bytes);

If you notice, our Traditional Chinese string is a byte array containing twelve distinct values, which leads us to the next piece: the strlen function is counting the number of bytes.

For English characters in ASCII, counting the number of bytes is equivalent to counting the number of characters in the string: all English characters can be represented by a single byte. So how does this work with the Traditional Chinese text? Let’s start by breaking down each character and checking what the byte array it produces is:

// unpack('C*', '你');
array:3 [
1 => 228
2 => 189
3 => 160
]
// unpack('C*', '好');
array:3 [
1 => 229
2 => 165
3 => 189
]
// unpack('C*', '荒');
array:3 [
1 => 232
2 => 141
3 => 146
]
// unpack('C*', '野');
array:3 [
1 => 233
2 => 135
3 => 142
]

As we can see, each of those characters is composed of multiple bytes, which is where the name multibyte comes from in the context of PHP’s multibyte string functions. One of the multibyte string functions is the mb_strlen function, which is the multibyte version of the strlen function. Using this function on our Traditional Chinese text returns the value four, which matches what we visually see:

PHP
<?php
// Returns 4
echo mb_strlen('你好荒野');