Computer Security lecture notes Copyright © 2004 Mark Dermot Ryan
The University of Birmingham
Permission is granted to copy, distribute and/or modify this document
(except where stated) under the terms of the GNU Free Documentation License,

One-way secure hash functions


Recap on computer security

The main concerns of computer security are:
Symmetric key encryption helps provide the first one, and a bit the second.

Secure hash functions

Purpose and usage. Secure one-way hash functions (also known as message digest functions) are intended to provide proof of data integrity, by providing a verifiable fingerprint, or signature, of the data. A one-way hash function H operates on an arbitrary length input message M, returning h=H(M). The important properties are:
  1. Given M, easy to compute h=H(M)
  2. Given h, hard to compute M such that h=H(M) -- "one-way"
  3. Given M, hard to find M' (different from M) such that H(M)=H(M')
  4. (Not always satisfied) Hard to find M,M' such that H(M)=H(M') -- "collision resistant"
Note that 4 implies 3 (i.e. if we could solve 3 we could solve 4), but not conversely. The strange thing about hash functions is that there are typically billions of collisions, or perhaps infinitely many (if the hash function really does take arbitrary-length input; most have some huge limit). But it is computationally hard to find a single one.

Examples of usage: unix passwords; integrity of downloaded files; fingerprints of PGP keys; signed fingerprints; etc.

Typically, a single bit swap in the input swaps about half the bits in the output. Example:

[mdr@hubert]$ md5sum
There is $1500 in the blue box.
05f8cfc03f4e58cbee731aa4a14b3f03  -
[mdr@hubert]$ md5sum
There is $1100 in the blue box.
d6dee11aae89661a45eb9d21e30d34cb  -

Brute force attacks on hash functions

The Birthday "paradox" is the surprising answer to this question: how many people must I gather in a room in order to have a probability >0.5 that two of them share the same birthday? (We assume that the birthdays are distributed randomly.) The answer is lower than one might guess: 23. A similar question is: how many 128-bit hashes must I compute before obtaining a clash? (Assume again that the hash function behaves randomly.) The answer is perhaps again surprising.

Let {x1, ..., xn} be n random numbers in the range 1 <= xi <= N. The probability of them all being different from a given number in the range is ((N-1)/N)^n. For this to be <0.5, n must be at least n = ln 2 / ( ln N - ln(N-1) ), which for large N is approximately (ln 2) N. Thus you need to test (ln 2) 2^128 inputs, approximately 10^38, to get a 0.5 chance of reversing the hash.

The more interesting question is: how large should n be in order to give a probability of 0.5 of a duplicate. The probability of no duplicate is
(N-1)/N * (N-2)/N * . . . * (N-n+1)/N
=
(1-(1/N)) * (1-(2/N)) * . . . * (1-((n-1)/N)

<
e-1/N * e-2/N * . . . * e-(n-1)/N

=
e-n(n-1)/2N
The middle inequality comes from 1-x<e-x. Setting this to be 0.5, approximating n(n-1) as n² and solving for N gives n=2 ln 2 sqrt(N). Thus for N=2128 , even n=1019 inputs gives us less than 0.5 chance of a clash. (1019 microseconds is 317 000 years.)

The MD5 secure hash algorithm

MD5 is widely used; for example, it is used by Red Hat so that users can verify that packages they download have not been tampered with (e.g., by introducing trapdoors); PGP uses it for message signatures. MD5 takes an input of up to 264 bits (approx 109 Gigabytes), and produces a 128-bit hash (how many collisions do you think there are?). Here is how it works.

Define four bit-munging functions on 32 bit variables:
F( X, Y, Z ) = (X && Y) || (!X && Z)
and similar for functions G, H, I. Define four subroutines like this:
FF( a, b, c, d, M, s, t ) is  a := b + ( (a + F(b,c,d) + M + t) <<< s )
where <<< s means left-circular shift of s bits; and similar subroutines GG, HH, II. The algorithm works as follows.
  1. Pad the input so that its length is just 64 bits short of being a multiple of 512 bits. Then, append a  64-bit representation of the input's length (before padding).
  2. Initialise 32-bit variables A, B, C, D to some specific constants.
  3. For each 512-bit block M
    {
    (M0, M1, ..., M15) := M;
    (a,b,c,d) := (A,B,C,D);
    /* Round 1 */
    FF( a, b, c, d, M0, 7, 0xd76..... );
    FF( d, a, b, c, M1, 12, 0x..... );
    /* a further 14 calls to FF like these ones, totalling 16 calls */
    /* Rounds 2, 3, 4  are similar, consisting of 16 calls to GG, HH, II */
    (A,B,C,D) :+= (a,b,c,d)
    }

  4. Output the concatenation (A,B,C,D).
The answer to the question about how many collisions there are in MD5 is this: each input collides with 2264 = 101018 others, on average. It seems paradoxical that collisions are so hard to find.

MD5 was invented by Ron Rivest in 1991 in response to partial cryptanalyses of its predecessor MD4 (also by Rivest). The improvements MD5 has over MD4 are just quantity rather than style (four rounds instead of three, etc).  SHA is also an MD4 derivative, but outputs 160 bits rather than 128.  MD5 is still thought to be secure, though at least one cryptanalysis by B. den Boer and A. Bosselaers in 1993 suggested some weaknesses, while a paper by Hans Dobbertin in 1996 succeeded in producing a collision to a variant of MD5. ((Look at Schneier p.430 for the sort of attack you could mount if you can find collisions in the hash function.))

Other hash functions

The literature contains a big family of has functions, including MD{2-5}, SHA-{1,256,384,512} and RIPEMD{,-160}. They all work on similar principles to MD5, but they differ in how much munging they do and how many bits they produce.

SHA-1 is a direct competitor with MD5, since both of them were developed at the same time in an effort to strengthen MD4. SHA-1 produces 160 bits; thus its "collision difficulty" is much higher, 2^80 = 10^24 compared to 2^64 = 10^19 for MD5. So it is 10000 times as secure against brute force attacks. What about its security against cryptanalysis? None known. Stallings: "little is known about its design criteria, so strength more difficult to judge". SHA-1 is slower, because it has to do more rounds and process more data. I guess it's between 2 and 10 times slower.


SHA-1
SHA-256
SHA-384
SHA-512
Message digest size
160
256
384
512
Message size
<2^64
<2^64
<2^128
<2^128
Block size
512
512
1024
1024
Word size
32
32
64
64
Number of steps
80
80
80
80

Keyed hashes

Keyed hashes (also known as MACs, for "message authentication codes") are hash functions parameterised by a key. They can be used to provide authenticity (since a message accompanied by a keyed hash may be considered signed by a person who has the key).

Ordinary hashes can be used to make keyed hashes. Let h be a hash. Let K be the key, padded it with zeros to make it the length of h's block size, b. Let

ipad = 00110110 (=0x36), repeated b/8 times
opad = 01011100 (=0x5C), repeated b/8 times
K1 = K + opad
K2 = K + ipad, where + is XOR

Then HMACK(M) = h( K1 , h( K2, M ) )



References

[1] Bruce Schneier, Applied Cryptography. Second Edition, J. Wiley and Sons, 1996.
[2] William Stallings, Cryptography and Network Security, Principles and Practice,  Prentice Hall, 1999. Third Edition, 2003.
[3] The University of British Columbia Theoretical Physics department has a web page on cryptography, with lots of interesting remarks and links.