I studied math at Cornell, and even though I was in the bottom half of my cohort, I still felt I learned some decent maths. Specifically, we spent a lot of time doing “proof-based” mathematics, where you start with simple properties and with clear, logical steps, we started proving more complex properties. It took a while for me to get used to it, but eventually, I learned to enjoy it.
Why is this valuable? You could prove a property about a set of numbers, say even numbers or prime numbers, and then you could claim the property held for every single item in that set. You could keep coming up with examples of prime numbers, and those properties would hold. That’s really cool! This felt concrete and it was amazing how elegant the proofs were.
During my last semester, I decided to take “Applicable Algebra”, where we learned about information encoding and how RSA encryption worked. Things got a little harder to prove as we started to use terms like “infeasible” or “close to impossible.” Things started to feel less concrete, but we were still able to prove powerful properties.
Recently, I started reading about cryptography again and was reminded about these terms. Specifically, I came up with a definition of collision resistance:
If you’re not familiar with the above notation, I’ll provide some basics. Hash functions typically take pretty long things, like long sentences, and encode them into something much smaller. For example, encoding the sentence “The quick brown fox jumps over the lazy dog” into something like “x3G5.” This is useful because “x3G5” is shorter, which is useful for transportation or storage. Also, it is useful because “x3G5” effectively hides the value “The quick brown fox jumps over the lazy dog.” If I was the one encoding that info, I could reveal at a later time that “The quick brown fox jumps over the lazy dog” was my sentence and you can conduct the same encoding and say, “huh, it really did encode to ‘x3G5’.”
This is significantly less valuable if lots of other long sentences also resulted in the encoding “x3G5.” Compressing and identifying items is a lot less useful if something that’s supposed to be “unique” really isn’t. And if a bunch of sentences result in the encoding, I could lie and later say I was using a different sentence, if it benefitted me.
You might be thinking, okay, well why don’t we just use an encoding that is always guaranteed to have different encoded values? If we want the encoded version to be smaller, this is impossible. Why is that? Well, our original sentence used over 30 letters, but our encoding used only 4. Intuitively, we know that the number of “possibilities” for the sentence is a lot more than the “possibilities” for just 4 letters. This means we can’t uniquely encode all the sentences, some of them will have to result in the same encoding (if you want to see ).
So, to get around this, a lot of smart people came up with an encoding where it’s “really, really, really hard” to find two things that result in the same encoding. That’s what “infeasible” in the above definition means – “yes, some stuff results in the same encoding, but I bet you can’t find one!” And you know what, they are right, it is really hard to find two values with the same encoding for these functions.
This was an adjustment for me and my mental model. I was used to the concreteness and definitiveness of the proofs we learned at school, but nothing in the real world is concrete or definitive. Everything is probabilistic, whether you believe it or not. When you walk outside, there is a non-zero chance that you will come back home, yet we still go outside. I came to understand the probabilistic nature of the physical world and the risks I take on a daily basis, but I still thought that there was concreteness in math and computers! So it took another set of mental adjustments to get used to hash functions and collisions. So, given their probabilistic nature, how confident should we be?
For a rough idea of how hard it is to find two colliding values, a popular Stack Overflow answer indicates, "A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10<sup>-15</sup>. That's 45 orders of magnitude more probable than the SHA-256 collision."
So it's practically impossible – some would even say "infeasible" to find colliding inputs, which is great because these functions are the ones that hide your passwords, encrypt your emails, and obscure national secrets. And it all works because smart people are playing a game of, “bet you can’t!”
Also published here: https://www.doubtiswelcome.com/posts/hashcollision