Our users’ passwords are extremely valuable to attackers. An attacker with access to user passwords pretty much has full control of our application. Such an attacker will probably be able to gain access to many other systems as well: unfortunately, people frequently use the same passwords on multiple websites.

We’ll obviously do everything we can to prevent compromise. But if we fail in that, we can take an important precaution to lower the value of our users’ passwords—we won’t store the user passwords at all. Instead, we’ll use a cryptographic hashing algorithm so that we store a value that’s derived from the password.

We’re going to start with a brief discussion of hashing and then walk through a brief history of password-storage techniques.

On storing (and not storing) Your users’ passwords

Instead of storing a user password, we can store a value that’s derived from the password itself. If this derivation can only be performed in one direction (that is, it’s easy to calculate the derived value given a password but it’s hard to go from a derived value back to the original password), then we’ll have a great defense. When a user creates an account for themselves, they’ll enter a password and we won’t store the password itself, we’ll only store the derived value. Next time the user logs in, they’ll type in their password, we’ll perform that same derivation and compare it to the previously stored derived value. If the two derived values match, we know the user typed in the right password. If they don’t match, we know the user typed in the wrong password.

At no time do we ever store the user password itself, only the derived value. This helps us in the unfortunate event that an attacker gets access to our database. Knowing the derived value of a password doesn’t help an attacker log in, because if they type in the derived value when they attempt to log in, the system will derive a value from the derived value, and that won’t match the derived value of the real password, so the attacker won’t be able to log in. These derived values are much less valuable to an attacker.

Real life derivations

This talk of derived values is a little magical and hand-wavy, so let’s talk about real-life derivations. A hashing function has the properties that we’re looking for. Hashing functions have been thoroughly researched in academia and widely used in industry. Hashing algorithms have a long history of use in preventing and detecting accidental corruption of data from unreliable networks and file systems. More recently, they’ve been cleverly applied to the problem of password storage. A hashing algorithm takes an arbitrary-length input and maps it into a fixed-length bucket called a hash code. The mapping of arbitrary-length input to hash code is said to be “one-way” only. That is, it’s easy to calculate a hash code for any input. But it’s “hard” to calculate the reverse of this and find an input given only a hash code. In this case, by “hard” we’re using the academic modesty of computer science—there are no known ways to reverse any of the widely used hash algorithms other than to try all of the possible inputs and see which one matches the given hash code.

So if we pick a hashing algorithm that produces hash codes that are uniformly distributed over a very large number of possible outputs, we have the beginning of a secure way to store our user passwords.

Let’s take a look at a concrete example—the popular SHA-256 hashing algorithm. SHA-256 takes an input of any size and maps it into an output of 256 bits (32 bytes). It’s kind of crazy that you could take a really large input, like all the data on a 50-GB Blu-ray disc, and map it into just 32 bytes. The thing to remember is that this operation is not compression because it is not reversible. Given a 256-bit hash, you can’t tell if the input that generated it is small or large or whether it was a Blu-ray, text, gif, or something else, and you can’t work backward to find the input that generated it.

Recall that back in What’s So Great About a Deck of Playing Cards? we saw that the number of different 256-bit strings is really big. Large numbers are necessary, but not sufficient, to keep passwords safe. To see why that is, let’s take a look at how attacks and defenses have leapfrogged each other over the years.

Store passwords in the clear

Initially, passwords were just stored in the clear on the server. The thinking was that an attacker who got as far as being able to read from the database had already “won,” so why bother doing anything else? You can see evidence of this having been a trend if you think back to websites in the 1990s and early 2000s. It was common practice at the time for websites to email passwords back to users who clicked on the “I forgot my password” button.

With passwords stored in the clear in a database, an attacker who gets database access, say, through SQL injection, can exfiltrate the passwords of every user of the system. This is a big problem, especially since people tend to reuse passwords across websites. So a breach at one site could impact many other sites that have no vulnerabilities at all.

Reversibly encrypt everyone’s passwords

One weak response to this threat is to encrypt all passwords before storing them in the database. But it’s weak because the master password that allows the system to decrypt all of the user passwords has to be known to the system. If it weren’t, the system would not be able to log anyone in. So this defense only helps in a very narrow set of circumstances.

Store hash of passwords

The important insight for secure password storage is to realize that you never have to store the password itself. Instead, you can store a value that’s derived from the password. You need a derivation that’s extremely unlikely to generate the same derived value for two different inputs. Initially, people used this insight by storing the output of generic hashing algorithms like MD5 and SHA-1 instead of the password itself. Then, when a user logged in, the hash output of the user-supplied input was compared to the previously stored hash output from when the user was created. If they’re the same, then the user is logged in. This was a step forward because the passwords themselves no longer had to be stored. So if the database were compromised, the passwords weren’t given up to the attacker directly.

Rainbow tables

For a time, the best an attacker could hope to do with password hashes was to look them up via rainbow tables. Rainbow tables are generated by hashing each ASCII input up to a certain length and storing the input and the hash so that one can look up an ASCII input given a hash. Rainbow tables take up a lot of space, and they take a lot of time to build. For example, there are 95 printable ASCII characters. If you want to generate the hashes for all printable passwords up to 8 characters in length, there are

(95)+(952)+(953)+(954)+(955)+(956)+(957)+(958)==6,704,780,954,517,1206.7×1015(95)+(95^2)+(95^3)+(95^4)+(95^5)+(95^6)+(95^7)+(95^8) == 6,704,780,954,517,120 \approx 6.7 \times 10^{15}

different printable 8-character passwords. That’s over 6 quadrillion. (I had to look up what comes after trillion.) Storing all of these takes up quite a lot of space. A SHA-256 hash is 64 bytes, so storing just the hashes takes up approximately 429,105,981,089,095,700429,105,981,089,095,700 bytes, or approximately 399,636,087399,636,087 GB or approximately 381381 PB.

Now if your password is chosen from a smaller alphabet, say just the lowercase English alphabet, the storage requirements drop dramatically to

(26)+(262)+(263)+(264)+(265)+(266)+(267)+(268)==217,180,147,1582.17×1011(26)+(26^2)+(26^3)+(26^4)+(26^5)+(26^6)+(26^7)+(26^8) == 217,180,147,158 \approx 2.17 \times 10^{11}

or approximately 202 GB. Hence a lot of the advice from this period was to choose a password with at least one uppercase, one lowercase, one numeric, and one non-alphanumeric character.

So rainbow tables are cumbersome and were even more cumbersome in decades gone by. But even with modern computers, we can see that the exponential growth of adding additional characters to passwords would make rainbow tables impractical. Even if we use the lowly lowercase English alphabet for our password, by extending the password length out to 15 characters, we get a staggering

(2615)==1,677,259,342,285,725,925,3761.6×1021(26^{15}) == 1,677,259,342,285,725,925,376 \approx 1.6 \times 10^{21}

That’s 5 orders of magnitude bigger than the key space of all printable ASCII passwords up to 8 characters in length. So the storage requirements for a rainbow table for 15-character passwords of just lowercase letters would be about 5 orders of magnitude larger, or about 95,341,155 PB. Wouldn’t it be cool if we could make all the passwords longer? That would make us pretty safe against rainbow tables. This is the insight behind salts.

A salt is a nonsecret value that is stored in the clear adjacent to the password. We should assume that an adversary who can read hashed passwords can also read the salt. The salt is concatenated with the plaintext password before hashing. This renders the rainbow table ineffective. If you just pick a salt of, say 14 characters, you know that even if a user managed to choose a one-character password (you should prevent this, by the way!), the adversary would need a rainbow table that contained at least 261526^{15}, or 1.6×10211.6 \times 10^{21} different passwords in order to attack this password, and we just saw that this requires about 95,341,15595,341,155 PB of storage, so we can see that salting makes rainbow tables completely impractical.

But no one should care about merely stopping rainbow table attacks because there are much more efficient attack vectors nowadays.

                                                 Q U I Z  

Get hands-on with 1400+ tech skills courses.