The Relation to String Slices

This next bit isn’t strictly necessary to understand most programming in Rust, but I think it’s helpful.

There’s a data type we haven’t directly talked about yet, called a char. It represents a single character. This could be the letter A, or the @ sign, or the Hebrew letter Alef (א), or many other things. We’ll get back to that part. In Rust (and many other languages), a character literal is a character surrounded by single quotes, e.g., 'A'.

A string, logically, is a sequence of characters. You can think of "Hello" as ['H', 'e', 'l', 'l', 'o']. However, that’s not the way Rust actually represents a string. Instead, it does something totally different. Let’s see why.

There are a lot of characters in the world. I mentioned the Latin alphabet, like the letter A. I mentioned symbols, like the @ sign. I mentioned other languages, like Hebrew. There are many, many thousands of characters we would like to deal with. There’s a group called Unicode, that gives a numeric representation for all of these characters. For example:

As of Unicode 12.1, there are 137,994 characters in Unicode. The question then is, how big must a char be to hold all of those different potential values? It turns out that the minimum size is 4 bytes.

If we used that array-of-chars representation for a string, then the word “Hello” would take up 20 bytes. Considering that a large amount of the work that computers do involves mostly-Latin data, it would be nice to make that smaller.

Thankfully, there’s an encoding called UTF-8 that helps with that. An encoding says how to represent a sequence of characters as binary data. UTF-8 is cool for many reasons, but for our purpose specifically, it takes only 1 byte to store each Latin character. For many other common scripts, it uses only 2 or 3 bytes. And the most it ever uses is 4 bytes. So UTF-8 never takes up more memory than an array of chars, and usually takes up much less.

Here is a demonstration of how the English versus Russian words for Hello are encoded as characters versus bytes:

String literal “Hello” “Aллo”
As characters [‘H’, ‘e’, ‘l’, ‘l’, ‘o’] [‘A’, ‘л’, ‘л’, ‘o’]
As bytes [72, 101, 108, 108, 111] [208, 144, 208, 187, 208, 187, 208, 190]