*From*: Anonymous <cripto@xxxxxxx>*To*: hashcash@xxxxxxxxxxxxx, camram-spam@xxxxxxxxxx*Date*: Mon, 24 May 2004 21:43:40 +0200 (CEST)

Sorry, I forgot to put a note in there, I wrote it and it is public domain. A little late now, unfortunately... Mostly I was hoping that people could try it out on a few different machines and see if it lives up to the claim of relatively even computing times. Maybe try creating a 14 bit collision with: ./mempow -g 14 and do that like 10 times to average the timing, then report the results. You need to create a file mempow.dat of 16 MiB + 1 KiB in size. As described in the comments I concatenated two 10 MB files from random.org, but for timing purposes you can probably use anything. Just copy a file that is > 16 MB in size to mempow.dat. A few comments about the Dwork algorithm. Having implemented it I understand it better. Basically it is an RC4 variant which uses this large, fixed 16 MB array T to perturb the RC4 a bit. You run the RC4 variant for 2048 iterations, then hash the A array (the RC4 state array) to get the output of the hash, where you will look for trailing zeros in the normal way. You keep trying this with a different seed for the RC4 initialization until you get an output with the required number of zeros. Regular RC4 looks like, in pseudocode: byte A[256]; (Initialize A) i = j = 0; Loop: i = (i + 1) % 256; j = (j + A[i]) % 256; swap (A[i], A[j]) output A[ (A[i] + A[j]) % 256 ] They wanted to incorporate the 16 MB T array into this, to perturb the A modifications. They will use the "output" of the RC4 loop to update the index into T. The problem is that RC4 only outputs 8 bits at a time. So they changed to a 32-bit variant of RC4, still with a 256 element array A: * long A[256]; (Initialize A) i = j = 0; Loop: i = (i + 1) % 256; j = (j + A[i]) % 256; swap (A[i], A[j]) output A[ (A[i] + A[j]) % 256 ] The only difference here is that A is composed of longs, so the output is 32 bits at a time. Now the problem is that all of the functions only use the low byte of A, because everything is mod 256. So they added a cyclic shift (rotate) in there: long A[256]; (Initialize A) i = j = 0; Loop: i = (i + 1) % 256; j = (j + A[i]) % 256; * cyclic shift (i.e. rotate) A[i] right by 11 bits swap (A[i], A[j]) output A[ (A[i] + A[j]) % 256 ] This way the other bits of A come into play eventually. Now they want to incorporate T. What they will do is have an index c into the T array, and use that to perturb this algorithm. Then they will use the "output" of the RC4 loop to modify c: long A[256]; (Initialize A) * c = A[255]; i = j = 0; Loop: i = (i + 1) % 256; j = (j + A[i]) % 256; * A[i] = A[i] + T[c] cyclic shift (i.e. rotate) A[i] right by 11 bits swap (A[i], A[j]) * c = T[c] xor A[ (A[i] + A[j]) % 256 ] The first new line initializes c; the second perturbs A[i] based on T[c]; and the third calculates the new c by xoring T[c] with the usual "output" of RC4. That's the algorithm. You run this loop 2048 times, then SHA-hash the A array to get the output (which they truncate to 128 bits for some reason). It's essentially the 256-bit variant of RC4 except for that one line where A[i] gets T[c] added. This will gradually alter all the elements of the A array, affecting its eventual hash. T is accessed via the c index, which jumps around unpredictably as a function of T[c] and the RC4 output. This makes anticipating which c will be used impossible, prevents pre-fetching, and makes the memory accesses be as slow as possible. An annoying problem with this algorithm is its endian dependence. I first implemented it on a big-endian machine and then when I tried it on a little-endian it didn't get the same results. The reason is that when you go to SHA-hash A at the end, the natural way is to do the whole array in one call. But that treats A as bytes, while it is defined as 32 bit words. It therefore is sensitive to how the bytes of a word are ordered in memory, which depends on endianness. I wanted to follow bigendian conventions, so on a little-endian machine it has to do an extra step to correct the endianness before hashing. A possible way to fix the endian dependence would be to go back to the original form with byte-wide RC4, and to use some other trick to get 3 bytes of output at a time. I was thinking of using 3 consecutive bytes of the A array: byte A[256]; (Initialize A) * c = A[253 through 255]; i = j = 0; Loop: i = (i + 1) % 256; j = (j + A[i]) % 256; * A[i] = A[i] + T[c] swap (A[i], A[j]) * c = T[c] ^ A[ (A[i]+A[j]) through (A[i]+A[j]+3) ] In other words, update c using 3 consecutive bytes of the A array to form a 24 bit value, and xor with T[c]. I don't see any way to shortcut or anticipate or pre-fetch for the new c, because you won't know which 3 bytes are going to be used until you calculate the new A[i] and A[j]. Here we are only using 1 byte of T to perturb the A state but that should be sufficient; since we're going to SHA-hash A in the end, even the smallest change is enough. Another variant would be to use T[c] to perturb j rather than A[i]: byte A[256]; (Initialize A) c = A[253 through 255]; i = j = 0; Loop: i = (i + 1) % 256; * j = (j + A[i] + T[c]) % 256; swap (A[i], A[j]) c = T[c] ^ A[ (A[i]+A[j]) through (A[i]+A[j]+3) ] This would keep A as a permutation of 0..255 as in the usual RC4, which may be beneficial in terms of mixing. And it would still be impossible to calculate the new A state (for SHA hashing purposes) without doing all of the T fetches in precisely the right sequence. I rather like this one because of going back to the usual RC4 approach of keeping A as a permutation. Also the addition of T[c] to j is similar to how RC4 does its usual keying so there is some precedent for it. Using 32 bit values for A in the RC4 structure seems a little questionable to me. Certainly I wouldn't trust it as an RNG although in this application it is probably good enough, it just needs to make c jump around unpredictably enough to prevent cache prefetching. But the endian business is a pain for an algorithm designed to minimize speed differences on different architectures. If I were going to actually use this method I might try something along these lines. One additional problem with this general approach is that the large, fixed T array has a security issue. The problem is that there could be a back door in T which let you compute T[i] without fetching it from memory. It could be some kind of simple function of i. If someone knew this back door they could compute collisions much faster than anyone else. Special spam software could be sold to take advantage of this. So it's important that T be trustably random. I'm not sure how to solve this, other than to assume that random.org had their tables up for a long time so they wouldn't have put any back doors in them. I did find the "one million random digits" online, which were published in book form back in the 60s or whenever, but that's only half a megabyte and we need 16. It might be sufficient to produce the T random values pseudo-randomly in such a way that it was obvious that doing so would be slower than just fetching from memory, such as by iterating SHA a bunch of times. As long as you were sure that no new architectures could speed up that calculation to be faster than a memory fetch, it should be OK.

**Follow-Ups**:**[hashcash] Re: Sample code for Dworkian memory-bound POW***From:*Jonathan Morton

- » [hashcash] Sample code for Dworkian memory-bound POW
- » [hashcash] Re: Sample code for Dworkian memory-bound POW
- » [hashcash] Re: Sample code for Dworkian memory-bound POW
- » [hashcash] Re: Sample code for Dworkian memory-bound POW