The Relation to String Slices
This next bit isn’t strictly necessary to understand most programming in Rust, but I think it’s helpful.
There’s a data type we haven’t directly talked about yet, called a char
. It represents a single character. This could be the letter A, or the @ sign, or the Hebrew letter Alef (א), or many other things. We’ll get back to that part. In Rust (and many other languages), a character literal is a character surrounded by single quotes, e.g., 'A'
.
A string, logically, is a sequence of characters. You can think of "Hello"
as ['H', 'e', 'l', 'l', 'o']
. However, that’s not the way Rust actually represents a string. Instead, it does something totally different. Let’s see why.
There are a lot of characters in the world. I mentioned the Latin alphabet, like the letter A. I mentioned symbols, like the @ sign. I mentioned other languages, like Hebrew. There are many, many thousands of characters we would like to deal with. There’s a group called Unicode, that gives a numeric representation for all of these characters. For example:
As of Unicode 12.1, there are 137,994 characters in Unicode. The question then is, how big must a char
be to hold all of those different potential values? It turns out that the minimum size is 4 bytes.
If we used that array-of-char
s representation for a string, then the word “Hello” would take up 20 bytes. Considering that a large amount of the work that computers do involves mostly-Latin data, it would be nice to make that smaller.
Thankfully, there’s an encoding called UTF-8 that helps with that. An encoding says how to represent a sequence of characters as binary data. UTF-8 is cool for many reasons, but for our purpose specifically, it takes only 1 byte to store each Latin character. For many other common scripts, it uses only 2 or 3 bytes. And the most it ever uses is 4 bytes. So UTF-8 never takes up more memory than an array of char
s, and usually takes up much less.
Here is a demonstration of how the English versus Russian words for Hello are encoded as characters versus bytes:
String literal | “Hello” | “Aллo” |
---|---|---|
As characters | [‘H’, ‘e’, ‘l’, ‘l’, ‘o’] | [‘A’, ‘л’, ‘л’, ‘o’] |
As bytes | [72, 101, 108, 108, 111] | [208, 144, 208, 187, 208, 187, 208, 190] |