Dev team at work

Yo I heard you got mathematics in your hashes


I’m going to walk you through an incredibly useful concept called “a hash”. When talking about software, it’s something you bump into very often. If you like to understand how programs work under the hood or to understand algorithms more deeply, this article may be useful for you.

What are Hashes?

Hash functions are functions that give a unique, fixed size identifier for any amount of data. The produced hash looks like a random string of a fixed length. Even a tiny change in the data produces huge changes in the end result.

The hash function used in this case is MD5 which is a one-way hash function based on the Merkle-Damgård construction. One-way means the original input of the function (the data being hashed) can’t be deduced or reverse-engineered from the resulting hash.

For example, the string “Hello Sujuwa!” produces the hash 4c6bd58f0490f85360f4e9842c3fec89. The string “Hello sujuwa!” produces 5894b2085f159b358cca4f6b6967f2ab which looks completely different (except for length and that it doesn’t use any special characters). From these hashes there’s practically no way to calculate the original string that the hash is based upon. The only theoretical way is for the attacker to guess or go through strings, hash them, and see if they match the target hash. As you can imagine, this takes X^64 iterations, where X is the length of the string. In practice, the number of needed iterations grows so fast it will take a very long time (think more years than hours or days) to find matches, with some caveats (long enough string, modern hash algorithm, ..).

Collision Resistance

Collision resistance of a hash function means how unlikely it is for two different sets of data to produce the same hash output. Of course if there are more combinations for input than output, there will be some theoretical amount of collisions. In any good hash algorithm the results for hashing are quite evenly distributed, meaning collisions are rare (as in 1 against 64^64 kind of rare).

The famous birthday problem is related to collision resistance. When you have a group of 23 people, the probability of 2 of them being born on the same date (ignoring the year) is around 50%. When you have 70 people, that likelihood is 99.9%. So to actually avoid collisions efficiently hashes have to be long enough. This is one of the reasons many modern hashes are at least 32-256 characters.

Use Cases for Hashes

Hashes have many uses but some of the more common ones are cryptography, checksums, digital fingerprints and randomization. In this example, authentication of origin (digital fingerprinting) is the one we’re interested in.

Verifying Data Integrity

How can we use hashes to check that some piece of data hasn’t been tampered with? This could be done for example by making an API request that contains both the transmitted data, and a hash that’s created with the data + a secret API-key authenticating the sender. If the reciver of the data also knows the API-key, they can also hash the data with the key, and verify if the data has been changed after the hash was created (for example by a man-in-the-middle attack).

The same principle can be used to check a data was transmitted correctly, by transmitting also the hash of the file. This way the receiver of the data can just hash it and check that the generated hash matches the one sent to them.

Managing user passwords in a database

A common use case is saving passwords. You know when you press “forget password” you are almost never sent the actual password, but just a reset link? Have you wondered why this is? It’s because most services don’t actually know your password, they just know its hash.

Actual passwords are rarely saved in a database, they’re almost always saved in hashed form (including a so-called “salt” - a random string that’s combined with the password when checking it). That way when the user enters their password, the password is hashed with the salt, and that hash is verified against the saved hash. If they match, the password is correct. If there is a breach and somebody steals the information from the database, the attacker doesn’t get the users’ passwords, just the hashed form.

This will make it much more difficult for people to try to use the information to test for users using the same password, as it is extremely difficult to try to guess the password just based on a hash. If the password is very common (let’s say “password”), then the attacker might try to hash the common passwords with the salts from the stolen database, and try to see if some of the user accounts use extremely common passwords. Still, this makes the process way more difficult than if the database just contained the passwords or non-salted hashes (in which case they could be compared against a dictionary of hashes of common passwords).

So now you should know that if some service can email your password, it’s a red flag, that they almost certainly store your password in a non-hashed form. If somebody hacks them, it is likely that your password will be compromised.

Recap / TLDR;

Now you should understand the basics of what hashes are, and something about how, why and when they are created. Understanding them is crucial for understanding in a more detailed level how blockchains work and how it’s verified that some information is not tampered with etc.

Some important things to remember:

  • Hashing can generate practically unique, fixed length identifiers from any data of any size
  • Hash collision is what happens when hash is not unique
  • Collision resistance says how rarely that happens (think birthday problem)
  • A hash can be used to verify that a given block of data has not been changed (just hash that data, and check the hash stays the same)
  • Some common uses include checksums and unique identifiers, cryptography and information security
  • They are so useful that it’s difficult to make an exhaustive list

I hope this gave you an general idea of what hashes are and that now you can see how useful they can be in many ways. Understanding hashes will make it easier to understand more complicated topics, like cryptography, blockchains, authentication and many more.