[hashcash] Re: Sample code for Dworkian memory-bound POW

  • From: Anonymous <cripto@xxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx, camram-spam@xxxxxxxxxx
  • Date: Mon, 24 May 2004 21:43:40 +0200 (CEST)

Sorry, I forgot to put a note in there, I wrote it and it is public
domain.  A little late now, unfortunately...

Mostly I was hoping that people could try it out on a few different
machines and see if it lives up to the claim of relatively even computing
times.  Maybe try creating a 14 bit collision with:

./mempow -g 14

and do that like 10 times to average the timing, then report the results.

You need to create a file mempow.dat of 16 MiB + 1 KiB in size.
As described in the comments I concatenated two 10 MB files from
random.org, but for timing purposes you can probably use anything.
Just copy a file that is > 16 MB in size to mempow.dat.

A few comments about the Dwork algorithm.  Having implemented it I
understand it better.  Basically it is an RC4 variant which uses this
large, fixed 16 MB array T to perturb the RC4 a bit.  You run the RC4
variant for 2048 iterations, then hash the A array (the RC4 state array)
to get the output of the hash, where you will look for trailing zeros in
the normal way.  You keep trying this with a different seed for the RC4
initialization until you get an output with the required number of zeros.

Regular RC4 looks like, in pseudocode:

        byte A[256];

        (Initialize A)

        i = j = 0;
        Loop:
                i = (i + 1) % 256;
                j = (j + A[i]) % 256;
                swap (A[i], A[j])
                output A[ (A[i] + A[j]) % 256 ]

They wanted to incorporate the 16 MB T array into this, to perturb the
A modifications.  They will use the "output" of the RC4 loop to update
the index into T.  The problem is that RC4 only outputs 8 bits at a time.
So they changed to a 32-bit variant of RC4, still with a 256 element array
A:

*       long A[256];

        (Initialize A)

        i = j = 0;
        Loop:
                i = (i + 1) % 256;
                j = (j + A[i]) % 256;
                swap (A[i], A[j])
                output A[ (A[i] + A[j]) % 256 ]

The only difference here is that A is composed of longs, so the output
is 32 bits at a time.  Now the problem is that all of the functions only
use the low byte of A, because everything is mod 256.  So they added a
cyclic shift (rotate) in there:

        long A[256];

        (Initialize A)

        i = j = 0;
        Loop:
                i = (i + 1) % 256;
                j = (j + A[i]) % 256;
*               cyclic shift (i.e. rotate) A[i] right by 11 bits
                swap (A[i], A[j])
                output A[ (A[i] + A[j]) % 256 ]

This way the other bits of A come into play eventually.  Now they want
to incorporate T.  What they will do is have an index c into the T array,
and use that to perturb this algorithm.  Then they will use the "output"
of the RC4 loop to modify c:

        long A[256];

        (Initialize A)

*       c = A[255];
        i = j = 0;
        Loop:
                i = (i + 1) % 256;
                j = (j + A[i]) % 256;
*               A[i] = A[i] + T[c]
                cyclic shift (i.e. rotate) A[i] right by 11 bits
                swap (A[i], A[j])
*               c = T[c]  xor   A[ (A[i] + A[j]) % 256 ]

The first new line initializes c; the second perturbs A[i] based on T[c];
and the third calculates the new c by xoring T[c] with the usual "output"
of RC4.

That's the algorithm.  You run this loop 2048 times, then SHA-hash the A
array to get the output (which they truncate to 128 bits for some reason).
It's essentially the 256-bit variant of RC4 except for that one line
where A[i] gets T[c] added.  This will gradually alter all the elements
of the A array, affecting its eventual hash.

T is accessed via the c index, which jumps around unpredictably as a
function of T[c] and the RC4 output.  This makes anticipating which c
will be used impossible, prevents pre-fetching, and makes the memory
accesses be as slow as possible.

An annoying problem with this algorithm is its endian dependence.
I first implemented it on a big-endian machine and then when I tried
it on a little-endian it didn't get the same results.  The reason is
that when you go to SHA-hash A at the end, the natural way is to do
the whole array in one call.  But that treats A as bytes, while it is
defined as 32 bit words.  It therefore is sensitive to how the bytes of
a word are ordered in memory, which depends on endianness.  I wanted to
follow bigendian conventions, so on a little-endian machine it has to
do an extra step to correct the endianness before hashing.

A possible way to fix the endian dependence would be to go back to
the original form with byte-wide RC4, and to use some other trick to
get 3 bytes of output at a time.  I was thinking of using 3 consecutive
bytes of the A array:

        byte A[256];

        (Initialize A)

*       c = A[253 through 255];
        i = j = 0;
        Loop:
                i = (i + 1) % 256;
                j = (j + A[i]) % 256;
*               A[i] = A[i] + T[c]
                swap (A[i], A[j])
*               c = T[c] ^ A[  (A[i]+A[j]) through (A[i]+A[j]+3) ]

In other words, update c using 3 consecutive bytes of the A array to form
a 24 bit value, and xor with T[c].  I don't see any way to shortcut or
anticipate or pre-fetch for the new c, because you won't know which 3
bytes are going to be used until you calculate the new A[i] and A[j].
Here we are only using 1 byte of T to perturb the A state but that
should be sufficient; since we're going to SHA-hash A in the end, even
the smallest change is enough.

Another variant would be to use T[c] to perturb j rather than A[i]:

        byte A[256];

        (Initialize A)

        c = A[253 through 255];
        i = j = 0;
        Loop:
                i = (i + 1) % 256;
*               j = (j + A[i] + T[c]) % 256;
                swap (A[i], A[j])
                c = T[c] ^ A[  (A[i]+A[j]) through (A[i]+A[j]+3) ]

This would keep A as a permutation of 0..255 as in the usual RC4, which may
be beneficial in terms of mixing.  And it would still be impossible to
calculate the new A state (for SHA hashing purposes) without doing all of
the T fetches in precisely the right sequence.

I rather like this one because of going back to the usual RC4 approach
of keeping A as a permutation.  Also the addition of T[c] to j is
similar to how RC4 does its usual keying so there is some precedent
for it.  Using 32 bit values for A in the RC4 structure seems a little
questionable to me.  Certainly I wouldn't trust it as an RNG although
in this application it is probably good enough, it just needs to make c
jump around unpredictably enough to prevent cache prefetching.  But the
endian business is a pain for an algorithm designed to minimize speed
differences on different architectures.  If I were going to actually
use this method I might try something along these lines.

One additional problem with this general approach is that the large, fixed
T array has a security issue.  The problem is that there could be a back
door in T which let you compute T[i] without fetching it from memory.
It could be some kind of simple function of i.  If someone knew this
back door they could compute collisions much faster than anyone else.
Special spam software could be sold to take advantage of this.  So it's
important that T be trustably random.  I'm not sure how to solve this,
other than to assume that random.org had their tables up for a long time
so they wouldn't have put any back doors in them.  I did find the "one
million random digits" online, which were published in book form back
in the 60s or whenever, but that's only half a megabyte and we need 16.

It might be sufficient to produce the T random values pseudo-randomly
in such a way that it was obvious that doing so would be slower than
just fetching from memory, such as by iterating SHA a bunch of times.
As long as you were sure that no new architectures could speed up that
calculation to be faster than a memory fetch, it should be OK.

Other related posts: