Computer Security lecture notes Copyright © 2009 Mark Dermot Ryan
The University of Birmingham
Permission is granted to copy, distribute and/or modify this document
(except where stated) under the terms of the GNU Free Documentation License,

One-way secure hash functions


Secure hash functions

Purpose and usage. Secure one-way hash functions (also known as message digest functions) are intended to provide proof of data integrity, by providing a verifiable fingerprint of the data. A one-way hash function H operates on an arbitrary length input message M, returning h=H(M). The important properties are:
  1. Given M, easy to compute h=H(M)
  2. Given h, hard to compute M such that h=H(M) -- "one-way", or "pre-image resistant"
  3. Given M, hard to find M' (different from M) such that H(M)=H(M') -- "second-preimage resistant"
  4. (Not always satisfied) Hard to find M,M' such that H(M)=H(M') -- "collision resistant"
Note that 4 implies 3 (i.e. if we could solve 3 we could solve 4), but not conversely. Also, 3 implies 2, but not conversely.  The strange thing about hash functions is that there are typically billions of collisions, or perhaps infinitely many (if the hash function really does take arbitrary-length input; most have some huge limit). But it is computationally hard to find a single one.

Examples of usage:

Typically, a single bit swap in the input swaps about half the bits in the output. Example:

[mdr@hubert]$ md5sum
There is $1500 in the blue box.
05f8cfc03f4e58cbee731aa4a14b3f03  -
[mdr@hubert]$ md5sum
There is $1100 in the blue box.
d6dee11aae89661a45eb9d21e30d34cb  -



Brute force attacks on hash functions

The Birthday "paradox" is the surprising answer to this question: how many people must I gather in a room in order to have a probability >0.5 that two of them share the same birthday? (We assume that the birthdays are distributed randomly.) The answer is lower than one might guess: 23. Compare that with this question: how many people must I gather in a room in order to have a probability >0.5 that one of them share the same birthday as me? The answer is much greater: 253.

Suppose there are N possible hash values, and suppose that the output of a hash function is randomly distributed in this space. Now take a set of n strings.
To try to put these numbers into perspective: 1019 microseconds is 317 000 years, while 1038 microseconds is 1024 years.

The MD5 secure hash algorithm

Ron RivestMD5 was designed in 1991 by Ron Rivest, and is widely used; for example, it is used by Red Hat so that users can verify that packages they download have not been tampered with (e.g., by introducing trapdoors); PGP uses it for message signatures. MD5 takes an input of up to 264 bits (approx 109 Gigabytes), and produces a 128-bit hash (how many collisions do you think there are?). Here is how it works.

Define four bit-munging functions on 32 bit variables:
F( X, Y, Z ) = (X && Y) || (!X && Z)
and similar for functions G, H, I. Define four subroutines like this:
FF( a, b, c, d, M, s, t ) is  a := b + ( (a + F(b,c,d) + M + t) <<< s )
where <<< s means left-circular shift of s bits; and similar subroutines GG, HH, II. The algorithm works as follows.
  1. Pad the input so that its length is just 64 bits short of being a multiple of 512 bits. Then, append a  64-bit representation of the input's length (before padding).
  2. Initialise 32-bit variables A, B, C, D to some specific constants.
  3. For each 512-bit block M
    {
    (M0, M1, ..., M15) := M;
    (a,b,c,d) := (A,B,C,D);
    /* Round 1 */
    FF( a, b, c, d, M0, 7, 0xd76..... );
    FF( d, a, b, c, M1, 12, 0x..... );
    /* a further 14 calls to FF like these ones, totalling 16 calls */
    /* Rounds 2, 3, 4  are similar, consisting of 16 calls to GG, HH, II */
    (A,B,C,D) :+= (a,b,c,d)
    }

  4. Output the concatenation (A,B,C,D).
The answer to the question about how many collisions there are in MD5 is this: each input collides with 2264 / 2128 others, on average; this number is approximately 2264  which is approximately 101018. It seems paradoxical that collisions are so hard to find.

MD5 was invented Rivest in response to partial cryptanalyses of its predecessor MD4 (also by Rivest). The improvements MD5 has over MD4 are just quantity rather than style (four rounds instead of three, etc).  SHA is also an MD4 derivative, but outputs 160 bits rather than 128.  MD5 was thought to be secure until 2004, even though prior to that its security had been partly eroded; e.g., a paper by Hans Dobbertin in 1996 succeeded in producing a collision to a variant of MD5.

MD5 brokenXiaoyun Wang

Collisions for the full MD5 were announced on 17 August 2004 at CRYPTO2004, by Xiaoyun Wang, Dengguo Feng, Xuejia Lai and Hongbo Yu. Their work is a combination of theoretical analysis (based on differential cryptanalysis) which reduces the computational problem, and brute force attack on the reduced computational problem, and they received a standing ovation for their work. Later,  on 1 March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne de Weger demonstrated the construction of two X.509 certificates with different public keys and the same MD5 hash, a demonstrably practical collision. The construction included private keys for both public keys.

Do we need collision-resistance?

These are still only collisions, i.e. breaks of the strongest property in the list 1-4 above. One might argue that md5 is still secure for use where only properties 1-3 are required. Many of the applications that use cryptographic hashes, such as password storage or document signing, are in principle only minimally affected by a collision attack. In the case of document signing, for example, an attacker could not simply fake a signature from an existing document -- the attacker would have to fool the private key holder into signing a preselected document. Reversing password hashing (e.g. to obtain a password to try against a user's account elsewhere) does not require collision resistance. Constructing a password that hashes to a given value requires a preimage attack.

Here are two reasons for believing that collision-resistance is important:


Colliding postscript files

Because MD5 makes only one pass over the data, if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more reasonable. And because the current collision-finding techniques allow the preceding hash state to be specified arbitrarily, a collision can be found for any desired prefix. All that is required to generate two colliding files is a template file, with a 128-byte block of data aligned on a 64-byte boundary, that can be changed freely by the collision-finding algorithm. Thus, it is possible to construct two colliding Postscript files with the same size with arbitrary contents, e.g.

Other hash functions

The literature contains a big family of has functions, including MD{2-5}, SHA-{1,256,384,512} and RIPEMD{,-160}. They all work on similar principles to MD5, but they differ in how much munging they do and how many bits they produce.

SHA-1 is a direct competitor with MD5, since both of them were developed at the same time in an effort to strengthen MD4. SHA-1 produces 160 bits; thus its "collision difficulty" is much higher, 2^80 = 10^24 compared to 2^64 = 10^19 for MD5. So it is 10000 times as secure against brute force attacks.


SHA-1
SHA-256
SHA-384
SHA-512
Message digest size
160
256
384
512
Message size
<2^64
<2^64
<2^128
<2^128
Block size
512
512
1024
1024
Word size
32
32
64
64
Number of steps
80
80
80
80

Cryptanalysis techniques have already begun eroding the security of SHA-1. Although a full collision has not yet been obtained, the same group in China (Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu) announced in February 2005 that collisions of  the full version of SHA-1 could be found requiring fewer than 269 operations (a brute-force search would require 280). Later, the reduced this to 263. Any analysis which reduces the complexity of a brute force attack is considered to be a break of the cryptography. In an interview, Yin states that, "Roughly, we exploit the following two weaknesses: One is that the file preprocessing step is not complicated enough; another is that certain math operations in the first 20 rounds have unexpected security problems." [5]

So what now? Migration to stronger hashes...


Keyed hashes

Keyed hashes (also known as MACs, for "message authentication codes") are hash functions parameterised by a key. They can be used to provide authenticity (since a message accompanied by a keyed hash may be considered signed by a person who has the key).

Ordinary hashes can be used to make keyed hashes. A simple idea is to set HMACK(M) = h( K, M ), but this is insecure: (depending on the hash function) the attacker  may be able to compute h(K,M') out of h(K,M) where M' is a superstring of M.  An alternative is HMACK(M) = h(M, K ), but if h is one-way but not collision-free, this could be broken as well. The possibility HMACK(M) = h(K, M, K) has been considered, but the preferred one appears to be HMACK(M) = h(K1, h(K2, M)) where K1 and K2 are keys derived from K:

Let h be a hash. Let K be the key, padded it with zeros to make it the length of h's block size, b. Let

ipad = 00110110 (=0x36), repeated b/8 times
opad = 01011100 (=0x5C), repeated b/8 times
K1 = K + opad
K2 = K + ipad, where + is XOR

Then HMACK(M) = h( K1 , h( K2, M ) )


References

[1] Bruce Schneier, Applied Cryptography. Second Edition, J. Wiley and Sons, 1996.
[2] William Stallings, Cryptography and Network Security, Principles and Practice,  Prentice Hall, 1999. Third Edition, 2003.
[3] The University of British Columbia Theoretical Physics department has a web page on cryptography, with lots of interesting remarks and links.
[4] Wikipedia articles on MD5 and SHA1.