What devs need to know about Encoding / Encryption / Hashing / Salting / Stretching

Written by proxyblue | Published 2020/02/21
Tech Story Tags: security | infosec | development | passwords | database | dev

TLDR There are two types of encryption - Symmetric Key encryption and Asymmetric Key Encryption. The difference between encryption and encryption is that encryption needs a key to encrypt/decrypt. Only someone with a key can view the unencrypted data, and only the user can read it. Developers should only ever use encryption algorithms and libraries that have been thoroughly tested. It's hard to create an algorithm that no one else can break, even after years of analysis by the best cryptographers around.via the TL;DR App

This is a typical exchange about encryption with someone willing to learn.
In the world of software development, I see people get encryption terms and usage wrong a lot. One of the many pieces of feedback I’ve received from courses I run is that they’ve never been clear on the differences and what is the best practice is when using them.
I have absolutely no judgement for these people, except, it’s very possible they’ve written something in production that is extremely insecure, and don’t realise it. And then hacked by some skid (script kiddie) with a script.
I’ve audited many external systems and some of our own systems and found insecurities in data storage and transmission because the developers don’t understand how to use certain security algorithms.
Let’s explore some concepts (with examples of best and bad practices). The examples reference encrypting a file, but many of the same principles are used for “storing passwords” and helping with authenticating into a service.

Encoding

Many people tend to confuse the term encryption and encoding — or use them interchangeably. This is understandable, because encryption is a form of encoding.
Encoding is the process of applying a specific code, such as letters, symbols and numbers, to data for conversation into an equivalent cipher.
This sounds like encryption^^.
The biggest difference between encoding and encryption is that encryption needs a key to encrypt/decrypt.
Because we use the word ‘encoding’ for other types of data formats, such as base64, unicode, UTF-8, other formats that don’t need a key to encode / decode, we almost never refer to encryption as encoding.
Encryption ‘is’ a form of encoding, but we never really call it that.

Encryption

As we just covered, encryption is the process of turning data into a specific format, and it requires a key to lock / unlock the encryption information. Only someone with a key can view the unencrypted data. Everyone else can only read cypher-text.
Now, it shouldn’t need to be said (but it will be) — you should ONLY ever use encryption algorithms and libraries that have been thoroughly tested. If you try to create your own crypto, and you don’t have a phD in Mathematics, decades of experience in cryptography and haven’t had your research peer-reviewed by other crypto-analysts, I promise you — your algorithm isn’t clever or secure.
Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break. It's not even hard. What is hard is creating an algorithm that no one else can break, even after years of analysis. And the only way to prove that is to subject the algorithm to years of analysis by the best cryptographers around.
- Schneier's Law
There are two types of encryption. Symmetric Key Encryption and Asymmetric Key Encryption. For the purpose of this blog post, I'm only going to go into Symmetric Key encryption in greater detail.

Asymmetric Key Encryption

Asymmetric Key Encryption is when the key used to encrypt the data is different to the key used to decrypt the data. This is also known as Public Key Cryptography.
I’m not going to go into too much detail here, since most developers end up working with Symmetric Key encryption or password hashing at some point in their lives rather than implementing a system that requires Asymmetric Key Encryption. TLS (https) and PGP are famous examples of using Asymmetric Key Algorithms.
If you want more in-depth knowledge as to how the mathematical magic behind having two different keys and shared keys works — I recommend looking up Diffie-Hellman key exchange and RSA algorithms.

Symmetric Key Encryption

Symmetric Key encryption is an encryption algorithm where the key we use to encrypt the data is the same used to decrypt the data.
The keys are the same. Symmetric Key Encryption.
This is pretty simple to understand. Developers tend to be very good at finding a good library to use an encryption algorithm for their chosen language. Github and Google have plenty of these.
The problem I see developers make is generating and storing keys.
Key generation is always a problem for unfamiliar developers

Bad dev move #1: Static Key

This is very common. A developer has understood that a key should be ‘strong’, and so they create a strong key within their application code and use that everywhere for encryption.
The problem is that once it’s discovered (either within the source code, or through cryptanalytical techniques), is that everything is easy to decrypt because you used the same key everywhere.
This key I'm using in my images is the ACTUAL key WhatsApp used to use to encrypt every backup on every phone on the planet.
A key need to be unique, and only known to the person who is allowed access to the information should be able to access it.

Bad Dev Move #2: Simple Keys

A key should only be known by the user authorised to view the information. Perfect. It’s password time.
Does this seem like a good idea?
When I see this, it’s clear the developer has understood that the key should not be static, and that only the authorised users should be able to view the relevant information.
But, as we know from experience, we can’t trust the user to encrypt information with a secure password.
Because this key is guessable, it has what’s called “low entropy”.

Entropy

Entropy measures how random data is when used in cryptography. A user’s password tends to have a low entropy, because it can be easily guessed or brute forced.
The internet has a… medium level of entropy
How do we fix this? How do we have data that is completely random, but also something that only the user will know? This is where ‘hashing’ comes in to help.

Hashing

Hashing is the transformation of data to a fixed-length value that represents the original data.
What on earth does that mean? Let’s say we have the sentence:
“The quick brown fox jumps over the lazy dog”.
If we put that piece of data into a specific hash algorithm such as MD5, we will get a 128-bit (usually shown as 32 hexadecimal digits) representation of that sentence.
If I change the data even slightly, we get a radically different hash.
Spot the difference. Or rather, ‘dot’ the difference 😅
This string could be a single character (a) or a 20TB file, MD5 will always produce a “32 hexadecimal representation of that data”. It actually works on a zero length message too.
You would have seen this used a ‘checksum’ when you download a file. If any part of that file has changed even slightly, the hash / checksum will be totally different.
Hashing is considered a “One way function”. I promise you, in my lifetime, I will not be able to get back 20TB of data from a 32 character hash. So, it is ‘impossible’ to reverse a hash to get back the original data.
Yes yes, Hermione, I know there’s a way to reverse the hash. I’m getting there…
Why is this important?
Because instead of using ‘password1’ protecting the data, now a longer hash, such as (7c6a180b36896a0a8c02787eeafb0e4c) is.
This is “better”. But it is not good enough.

Why is hashing not good enough?

Because, surprisingly, the entropy hasn’t changed.
It has just “changed outfits”.
Uh oh, the Harry Potter gifs have started…
The key is technically still ‘password1’, even if it looks significantly more secure.
Hashing is an excellent technique, but by itself, is a very poor form of security.
Below is a list of popular passwords, with their equivalent hash. Now, anyone with the smallest database knowledge will know where I’m going with this.
MD5, SHA-1, SHA-256. Doesn’t matter. They’re all insecure if used improperly.
There exists tables with a LOT of password/hash lookups. Many free. Many paid. cmd5.org have access to 24,000,000,000,000 MD5 password lookups. Chances are —majority of passwords are in there.
If I get given a hash, how long does it take to look up what a password is? It’s an O(1) lookup. It’s instant. The attack is known as a ‘rainbow table attack’, a look up to a database that has all the passwords, or, “all the colours of the rainbow”.
I didn’t even need to pick the hash algorithm. It’s still an O(1) lookup.
If I get access to a hash, I can reverse it into the password. If I know what hash you’re using, I can still brute-force with just the hash version of the password.
Remember kids, only hashing is bad.
Sometime I see a developer try to fix this by using multiple hash functions md5(sha1(sha256(password))). This is not clever, and databases for many many combinations of nested hashes exist.
Basic example
So how do we fix this? When hashing, always use a salt.

Salting

I feel like this blog post has reached ridiculous levels of terminology and anti-terminology, but I promise you — the best way to hash is to use a salt.
What is a salt.
A salt is a random string. Something super-random.
Probably 256-bit, or more — up to you.
This random string, or salt, is added to the password (to make the password more secure) and then hashed.
This prevents rainbow table attacks.
If I get given this hash, and try to put it into a rainbow table, I highly doubt that a database exists that has this password + salt combination:
Password1!+ae4f27435df0896820a8325ed9562854bf0413e9a847e48c8f3e22b5ef06568f
This salt can be stored somewhere in clear-text. For our “encrypted file”, it can be stored alongside it somewhere. For other cases, such as storing sensitive data that has been hashed with a salt in a database, the salt can be stored in clear-text.
A salt can be stored in a database in cleartext. Make sure you use a different salt each time
When the user types in their password and attempts to unencrypt a file (or authenticate into a server, whatever you’re doing) — the salt is appended to the beginning or end of the password and then hashed.
If that hash matches our stored hash, then the password they entered must have matched the entry when it was created — which means they know the key/password.
MD5(Password1+ae4f27435df0896820a8325ed9562854bf0413e9a847e48c8f3e22b5ef06568f)
will always equal
dc5986dec531657e138ca4168a30dde3.
Same process — same result. You’re authenticated.
The slightest deviation will result in a different hash and the user will not be authenticated.
Not even close…

Question from a dev: Why can a salt be in clear text? If an attacker gains access to the whole database, they can see the salt? Why bother?

Great question. If we didn’t use a salt, then we would have a ‘hash only’ database — which means I can reverse each password very quickly based on pre-computed rainbow tables. Your database will immediately have results that match my hash database.
It means that every user that had the Password1! as their password, would all share the same
0cef1fb10f60529028a71f58e54ed07b
hash. Hash-only is not any more secure than just storing a password, it is just in different clothing.
If an attacker gains access to a salt, they would have to generate an entire rainbow table PER USER. This is very time consuming. Slowing down attackers is our goal here.
Here is a website that tracks all leaked databases and the % of passwords that have been successfully reversed. Any database that is hash only usually has approximately >95% of passwords reversed: hashes.org

Bad Dev Move #3: Static Salt

I see this from time to time. The salt is embedded in the source code, which means that every time a key is generated, the salt used is always the same. This means that an attacker may need to generate their own rainbow table, but they only need to create one table that is based on your ‘static salt’.
A ‘static salt’ is affectionately called a ‘pepper’ and whether they’re good practice (when used properly with a salt) is debatable. I tend to warn people against using them.
If you do want additional protection similar to a static salt, you should encrypt the hash with an application/server-stored password, which allows you to decrypt the information when required, but also allow you to change the app/server-level password if it's become compromised or if security policy needs to be changed.
But a static salt by itself is definitely bad practice.
Bad Dev Move #4: Insecure Salt
Your salt needs to be cryptographically random.
I hear you. WTF Mr. blog post writer — how can someone get randomness wrong.
It’s possible.
Previously, we were talking about entropy. Which of the below do you think is the most random?
I’ll give you a clue. It’s not rand(). Or a time-seeded rand()

You must salt with a CSPRNG

CSPRNG (pronounced see-spring) — Cryptographically Secure Pseudo Random Number Generator.
Pretty much every platform / language has access to CSPRNG algorithms. These tend to be, but aren’t always, linked to a hardware-generated random number generator. The important thing is that it is considered secure from a cryptography perspective.
If you think have a high-entropy random number generator doesn’t matter, I present to you exhibit A

Key Generation — the story so far

We know the key should be unique (password)
We know the password isn’t enough, and that we should add random data to the password (salt)
We know the salt needs to have high entropy (CSPRNG).
This is a pretty secure key… but…
Is this enough?

Stretching

We are so close. There’s one more part of our secure key generation story and that is ‘stretching’, or ‘key stretching’.
What is stretching? Basically the exact same Hash(pass+salt) process, but the process is repeated a lot. When I say a lot, I mean 1000, 10,000 or even 100,000 times.
Round and round and round and round
Why do we do this? Because it’s slow to calculate. And slow means that it’s significantly slower for an attacker to generate a rainbow table for a given salt.
Best practice is usually do as many rounds as it takes 1 second to calculate, but this will absolutely depend on what resources you’re utilising — this may not always be possible.
Slowing down an attacker is the goal. It’s why we have long passwords, it’s why we also perform stretching.
How do we ‘stretch’? We use a specific stretching algorithm. These tend to be GPU resistant and resistant to other highly parallel custom-built hardware. Don’t just put it in a while (count != 1000) loop.
For example (numbers are approximate and not calculated by me): 
A Radeon 7970 can calculate approximately 3,375,000,000 raw SHA-1 hashes per second. The same graphics card using PBKDF2 with HMAC-SHA1 with 10,000 iterations — only 140,000 hashes / second.
8 char password with PBKDF2-HMAC-SHA1 = 95⁸ / 140,000 =~ 1502 years.
8 char password with raw SHA-1 = 95⁸ / 3,375,000,000 =~ 23 days.
Significant difference.

Key Concepts (pun intended) and recommendations

Not stored key: User’s password. Recommended to enforce good password policy to help prevent brute-force attacks (minimum length, uppercase/not common word etc)
Stored salt: Random data generated with a CSPRNG. The CSPRNG will depend on the platform you’re using.
Hash: One-way function turning data into a fixed-length value representing the original object. MD5 is NOT recommended for security (it’s fast, and it’s prone to collisions). SHA-1 is also no longer recommended for hashing.
Anything from the SHA-2 family is currently recommended (SHA-224, SHA-256, SHA-384 and SHA-512), although be aware of “length extension attacks” which affect most variants of the SHA-2 family. Don’t mess with the crypto-analytic folks — those people are crazy-good. Also, don’t just pick SHA-512 “because it’s the bigger number” 🤦🏻‍♂️.
The SHA-3 family doesn’t have widespread adoption yet, but we’ll move to it eventually, once weaknesses within SHA-2 are found. You’re welcome to use it if appropriate.
Stretching: Rinse and repeat. Current recommendations are PBKDF2 with HMAC and SHA-(1/256/512 whatever works for your case). bcrypt is also good. scrypt is good. Argon2 is great, but I have no experience with it. There are others — lots of them are good. Google the one that works for you and your platform of choice. Anything is better than a while() loop.
Symmetric Key Encryption: Uses the same key to encrypt and decrypt information. Current recommendations are AES128 / AES 256. In 2018, be warned away from anything referencing AES 512 — it doesn’t smell right — it just looks like a big number to please people. AES is currently defined for 128/192/256 key sizes only.
Don’t roll your own encryption: I don’t care how clever or small/big you are. Don’t do it. Here are two examples of high profile companies doing this. Including another security blunder by WhatsApp!

How does this apply to storing passwords for a web service?

When you have a web service, you should never be storing passwords.
You should only store stretched, salted hashes.
The salt ‘can’ be stored in a seperate area to the hash, but it’s debatable if that’s segregation or just obfuscation
When I type in my password (Password1!), your server-side code should add the known salt to my password, utilise the stretching algorithm (eg PBKDF2) x number of times (make sure it’s the same as when it was initially created), and it should output the same value as the stored hash in the database. If it doesn’t, then the initial value (the password) was incorrect.

Why do we do this?

If an attacker steals our database, or it leaks somehow — they have absolutely no access to a user’s password. If two users have the same password, the hashes will be utterly different because their salts will be different. We have sufficiently protected our user’s data — unlike many of these examples.

Summary

Phew. Another long security blog post. I’m thinking of writing a Part 2, if you have any suggestions and want me to write about more message me on twitter @proxyblue or hit me up on the best InfoSec discord around
If you liked this post, check out some of my other popular posts:
Introducing the InfoSec colour wheel — blending developers with red and blue security teams.
https://hackernoon.com/introducing-the-infosec-colour-wheel-blending-developers-with-red-and-blue-security-teams-6437c1a07700

You ‘do’ InfoSec right? What do you read? Who do you listen to?https://louis.land/you-do-infosec-right-what-do-you-read-who-do-you-listen-to-e8d00b7d8ace
Also, I originally learned many of these concepts years ago thanks to this post.
Phew. Glad I got all that out.
Obligatory Scott Pilgrim Gif.
Peace out.

Written by proxyblue | Developer. Security Guy. Currently reading the internet. ❤️ innovation and NeuroTech. @proxyblue
Published by HackerNoon on 2020/02/21