A hash function takes an entry, for example, a number, phrase, file or password and calculates a sequence from it, usually represented in hexadecimal format.
Examples of hash functions are: MD4, MD5, SHA1, SHA256, SHA512.
The first prerogative of a hash function is that, given an entry and the hash calculated, it should not be possible to calculate the original value back. In the example function, the source word could have been any combination of the letters involved or even other letters and larger or smaller words.
Another prerogative of Hash algorithms is that, given an input, the output is always the same, for example, the Hash MD5 of the letter A will always be: bf072e9119077b4e76437a93986787ef, no matter how many times we use it.
In an efficient hash algorithm, when changing 1 bit of the input, at least half of the output bits must be changed, so while the MD5 of the letter A is bf072e9119077b4e76437a93986787ef, the MD5 of the letter B, which is only 1 bit apart is 30cf3d7d133b08543cb6c8933cdf6, something totally different.
This is to avoid reversing the Hash, that is, trying to deduce what the entry was from a known Hash.
That is why Hash algorithms are good for storing passwords, a user who has set his password with the word “secret” will have stored in the database or password file the word 88dabf3d5306998c155971257440f3be equivalent to his MD5 in hexadecimal notation.
Someone who has access to this database will not know directly that the password was “secret” and in fact neither the letters that were part of it or that the password was 6 letters long.
When that same user tries to authenticate, the system should calculate the Hash MD5 of the password entered and compare it with the stored word 88dabf3d5306998c155971257440f3be, if they match, the user has set the correct password.
But there is a problem, the MD5 Hash for the word “secret” is and always will be 88dabf3d5306998c155971257440f3be, that is, a brute force attack (where we test several words and calculate the MD5) would find short words that exist in the dictionary quickly.
This is even more serious in systems that only allow numeric passwords, many users use dates of birth, birth date of childrens or marriage date, if we consider all dates from 1900 to 2100, we have only 73048 days, in a system that asks for 8 digits where the user had 100 million possible combinations, testing only 0.073% of the combinations should break close to 99% of passwords.
This is why we ask users to add symbols, numbers and vary between upper and lower case, to increase the complexity of the attack.
Brute force attacks can be optimized for dictionary attacks, for example, we can calculate the hash of all words in English, peoples names, pets names, cities and dates with 6 and 8 digits.
A few Gigabytes are enough to store all the hashes and then, finding a password from the hash would take less than a second in an ordered database. So much so that there is a website that does this for free: https://crackstation.net/ and they provide the databases.
An attack where the Hash are pre-stored, the tables are called Rainbow Tables.
In this other site we have commands to generate our own tables with varied characters using up to GPU: https://project-rainbowcrack.com/table.htm, the result can be a few Terabytes of used disk, but it causes the attack to occur in seconds .
To avoid a dictionary attack or a rainbow table attack, the Salt technique is used, in a simple example, let’s say the salt is the letter A for the first time, and the letter B for the second time, we would calculate the MD5 of “APassword “and” BPassword “, which are respectively: bfea17d7c6a92d55f77b81a50326f421 and e6870c54d31bc0519cbc243a0cc2ccdc, totally different words.
We store salt along with Hash, for example, you can see this format on some systems: $A$bfea17d7c6a92d55f77b81a50326f421 and $B$e6870c54d31bc0519cbc243a0cc2ccdc. Salt does not have to be secret, as it only serves to prevent rainbow table attacks.
To validate a user’s password, we take the stored Salt, concatenate the typed password and calculate the Hash, if it matches the stored word, the user set the correct password.
Of course, in production environments, Salt is larger than 1 byte in the example.
Salt does not prevent dictionary attacks or brute force attacks (trying all combinations of letters, numbers and symbols), but the computational cost to crack a password starts to become inviable, for example, a 10-character password where customer can choose uppercase, lowercase, numbers and some symbols may take several years to be guessed.
Another point about hash algorithms is that they allow collision, that is, two or more data entries can generate the same hash, although a good algorithm tends to generate few collisions.
It is impossible to avoid collision, since for the set of inputs there is an infinite possibility (we can have from small files, with a single byte, to files with dozens of terabytes, where the content is random), while for the output set the possibilities are finite (for example, in the MD5 algorithm, we have 128 bits of output, which would give 16 characters, or 32 characters in hexadecimal notation), a large but still limited amount.
There are practical examples of collision on MD5, at https://www.mscs.dal.ca/~selinger/md5collision/ we have two strings that generate the same MD5 (note that they are slightly different):
d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f89 55ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5b d8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0 e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70
and
d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f89 55ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5b d8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0 e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70
And at https://natmchugh.blogspot.com/2015/02/create-your-own-md5-collisions.html the author used small changes in the bytes of the images (changes of 1 bit per pixel, which slightly changes the color, imperceptible to the human eye) to make these two images have the same MD5:
There are other demonstration examples, such as two PDFs with different content, but the same size, having been changed in the Metadata of the files so that there was an MD5 collision, or two different executable programs with the same MD5.
To avoid receiving a malicious file, one tactic is to use two distinct hash algorithms, computationally it is very expensive to generate two files that collide with two different algorithms, for example, MD5 and SHA1.
Despite these problems, Hash is an excellent solution for storing and validating passwords and for checking data and files received (ISO downloads for example).
To avoid problems, always try to use the latest available options and libraries provided by the programming language.
IMPORTANT: MD5 was used here only as a didactic example, it is known that it is possible to calculate collision in MD5 from 2006 studies, that is, a long time ago …
IMPORTANT 2: the Hash calculation example shown in figure 1 is for didactic use only, do not use in production
IMPORTANT 3: the example of Salt presented here to store passwords, although correct, is very simplistic. If you need to store and authenticate passwords on your system, use the language’s native libraries, you won’t becoming a cryptography expert by reading an article on a blog post