Data integrity refers to the validation that data has not been modified or corrupted during transmission. It is an important topic to grasp before explaining TLS. Let’s illustrate it now.
Suppose JayP and Joe are communicating over the public internet. Also, suppose both of them have a symmetric key and they have decided on the symmetric encryption algorithm to use.
Here JayP sends his encrypted message. Encryption means confidentiality, and that no one can understand the message because it is encrypted.
However, that does not mean that a malicious Chady cannot interfere with the message and change it, even though he does not understand it.
For example there is an attack known as bit flipping attack.
Let’s have a bank transaction example to show how damaging this attack can be.
Let’s say JayP is sending 100$ to Joe.
Chady intercepts the message and although he does not understand a thing, he flips some bits and transfers the message to the bank.
When the bank decrypts the message, it may show as 1000$ instead of 100$.
That’s why, in addition to encryption, we need to make sure the data has not been altered.
Back to our previous example.
How can Joe make sure that the message has not been modified? 🤦🏻♂️
The standard approach would be hashing.
Let’s recall how hashing is typically used for data integrity.
JayP will apply a hash function to the data he intends to send, and will generate a fixed size hash value, often referred to as the “digest” or “checksum”, then he will send the data with its hash over the internet. If a single bit of the message is changed then it would output a different hash.
When Joe receives JayP’s message, he will apply the same hash function to the message. Then, he compares the computed hash with the received hash. If both match, it indicates that the data has not been altered by a malicious Chady.
However, think for a second. Do you think this is enough?
What do you think a malicious Chady would do to alter the data and still pass the integrity check?
After all, hackers are smarter than that. Why not, alter the message and also send an altered hash ? 🤦🏻♂️ How then can Joe know that the hash and the data have been modified ?
Here comes the role of HMAC.
HMAC stands for Hash Based Message Authentication Code. It is a cryptographic authentication technique that uses a hash function and a secret key.
HMAC = hashFunc(secret key + message)
For HMAC to work, both JayP and Joe should agree on a shared secret and a hash function.
In the TLS handshake article you will understand how the hash function is decided and how the secret key is generated through key exchange protocol.
Let’s see the difference now.
Now the hash function is applied with both the encrypted text and the secret.
After that the generated MAC or message authentication code will be sent along the encrypted text.
Let’s say Chady intercepts and changes the message along with its MAC.
Remember that Chady does not have the secret key and will hash the encrypted message without any additional information.
Joe, after receiving the message, will apply the HMAC function with the message and the secret key.
The Message authentication codes will not match and Joe will know that the message has been tampered.
So, let’s explain some of the details about MACs and HMAC, because I want you to be familiar with all the terms and to not be confused. 🫤
HMAC is an algorithm that generates a Message Authentication Code.
As explained, the message authentication code provides authenticity in addition of integrity, because it uses a secret key, and that’s how it mainly differs with a regular hash.
There exist many algorithms for calculating a message authentication code, and HMAC is just one of them.
Other examples include for example POLY1305.
Message Authentication codes are also used in authenticated encryption, which we will talk about in a separate article.
HMAC is used in the TLS handshake, particularly in the “Finished” messages, where a MAC of the entire handshake up to that point is sent.
HMAC is suitable for TLS for its many advantages:
HMAC's security has been extensively analyzed in the cryptographic community. The HMAC construction has been proven to offer security properties such as pseudorandomness, which means the output of HMAC resembles a random sequence, and unpredictability, which ensures that the output is difficult to predict even if the input is known.
This additional randomness is achieved by the use of fixed strings defined in the HMAC algorithm, called ipad (NO, not Apple’s iPad, RFC is from 1997) and opad, which are used to modify the key, and a series of steps including XOR operations, data appending, and applying the hash function to achieve the final MAC.
The RFC particularly highlights the resistance against birthday attacks, which is an attack aiming at finding two messages leading to the same hash.
Okay, so why exactly does HMAC internally modify the key?
Well, In cryptography, a key needs to be random so that attackers can’t detect any patterns, because patterns make it easier for hackers to figure out the key.
That’s why the ipad and opad values are critical components in the HMAC computation process, ensuring that the shared secret key is mixed and processed in a specific manner, enhancing the randomness of the key, and strengthening the security of the message authentication process.
So that’s it! I hope you are enjoying this mini-series about TLS.
Also published here.