Using Laravel for Advanced String Manipulation in PHP/

...

A Brief History of Character Encodings

Learn about the brief history of character encoding, including string conversion.

We'll cover the following...

Introduction to strings
History of character encodings

Press + to interact

The first such question might be: what is a character? When we look at the text on a screen or in printed media, we naturally assume that each visually distinct unit of a piece of writing is a character. For example, if you grew up speaking and writing English, it is second nature to look at the following text and be able to quickly see that it contains sixteen distinct characters:

Hello wilderness

But what about the following?

你好荒野

Since you’re this course is digital, it is possible to highlight each character in that text, and you would ascertain that the previous text contains four characters. Let’s make doubly sure of this and use PHP’s strlen function to count the characters in this string:

Press + to interact

This time the function returns 16, which is the correct number of characters in our text. What’s going on here? To answer this question, we should take a step back and think about how we store and represent text in computer memory.

History of character encodings

Suppose we must develop a compact way of sending messages between two groups, one that reduces the total amount of data sent but does not lose any information. Let’s say we only need to concern ourselves with words, not numbers or special characters. We could analyze previous communication between the two parties, find the frequent words and phrases, and represent them as numbers or symbols. When we want to send a new message, we will replace those words and phrases with their corresponding number or symbol. When the recipient of the message goes to read the message, they would consult the table of numbers and symbols and reverse the process to get back the original message. We would have developed an encoding scheme—a way to represent and interpret data.

When we look at the history of text data and character encodings within computer systems, it does not take long for the name ASCII, an abbreviation of American Standard Code for Information Interchange, to come up, which has its roots in telegraph codes developed at Bell.

In the early days of computing, it was widespread for each company or organization producing a computer system to go through the same exercise of creating a unique way of storing text: to make a unique character encoding system. These systems were rarely compatible, and many parties, with the support and direction of the then American Standards Association, undertook work throughout the mid to late 1990s to produce a standard way of encoding text.

Much of this work was not widely adopted until March 11, 1968, when U.S. President Lyndon B. Johnson mandated that ASCII become a U.S. federal standard. In addition to promoting ASCII to a federal standard, the President also made it a requirement that all computers and related equipment purchased by the U.S. government starting July 1, 1969, be ASCII-compatible. Fast forward from these early days of the development of the standard (although there were many revisions to it in the following years) to 1981, when IBM used it to encode text with their first personal computer.

We will skip a lot more history and nuance, but we have enough context to proceed. Let’s now work to understand what all of this has to do with the wrong results of the strlen function when attempting to count the number of characters in our piece of Traditional Chinese text. The first fact to know is that ASCII was initially devised as a 7-bit standard and provided an encoding for 128 characters. The nice thing about this is that each character is easily represented by a single byte within a computer. Or, put another way, each character is mapped to an integer between 0 and 255, the maximum value we can represent with a single byte.

If we were to convert our English string into a byte array using the following code:

Press + to interact

If you notice, our Traditional Chinese string is a byte array containing twelve distinct values, which leads us to the next piece: the strlen function is counting the number of bytes.

For English characters in ASCII, counting the number of bytes is equivalent to counting the number of characters in the string: all English characters can be represented by a single byte. So how does this work with the Traditional Chinese text? Let’s start by breaking down each character and checking what the byte array it produces is:

Press + to interact

Introduction

What Are Strings?

Fluent Strings

The Formatting Helper Methods

The Logical Helper Methods

The Construction Helper Methods

The Extraction Helper Methods

Padding Strings

String Translations and Extension

Lines and Words

Applied Techniques: Writing a Gherkin Parser

Markov Chains and Text Generation

Fixed Width Data Parsing

Splitting Strings

Applied Techniques: A Blade Directive Validator

Working with HTML

Regular Expressions

Conclusion

Appendix

A Brief History of Character Encodings

Introduction to strings

History of character encodings