[hashcash] Re: Hashcash performance improvements

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Mon, 31 May 2004 09:37:17 +0100

Well there are the guidelines. In any case, a limit should be named.
Standards normally have arbitrary limits written into them, as without
them no computer with finite memory could ever claim to be standards

I have a feeling the 512 bytes of SMTP should be the guideline, minus
some reasonable amount for the command itself. That's over 6 80-column
lines, BTW.

Sounds reasonable. People might chew into it a bit because eg CAMRAM was proposing to transport a public key. Now these are eg 1024-bit + range, that's only 2-3 lines, but they have to be careful what other cruft they put in there.

The way I see it, if you really need to transport a lot of data for authentication, you should use your own header to do it. If you want it to be covered by hashcash, you can do a hash of your header and put the hash in the extension field, with equivalent results.

I'd like to rewind this whole discussion a bit though and point out that I don't see the need for the sub-puzzles. Is the only reason that it makes minting time more predictable by reducing the variance? I can't see much value in that, although maybe I will when I'm desperate to send an email quickly :-)

I think that's the primary goal, yes.

I believe that was Izzy's motivation. I'm kind of hung on this one, I did consider it also earlier (well back in 97 already I was thinking whether 2-way or k-way collisions would be better technically, but rejected for complexity). I hate adding complexity, and sub-stamps add complexity. (I like the fact that you can verify stamps with sha1sum binary plus a bit of sh script).

Yes, I buy this argument, at least a bit. Adding complexity also makes us less likely to get MUA vendors on board, especially the ones that want to make their own implementation (for whatever corporate reason, which may not be entirely rational).

We'd need to think carefully about how to limit this damage, and
whether the benefits outweigh the additional complexity in the spec.


An obvious solution is to limit the number of microstamps allowed per
macrostamp - this automatically puts a lower limit on the value of each
microstamp. Processing of a macrostamp should stop as soon as an
invalid microstamp is found, but valid microstamps should be kept so
that the same stamp can't be reused in the same attack.

I presume you mean would have to avoid:


being considered valid with the over-writing semantics.

Well, that and sending a batch of mails with the same set of stamps attached.

The down-side of the approach I proposed earlier is speed-optimized
versions might literally try padding out to another 64-9 boundary for
each sub-stamp, which would eat into extension space and make ugly
less human readable-stamps.

Indeed. But if you use a known pad character and shorthand, then the stamp appears small in the header and is only expanded for verification. But again, this rules out the old "pipe into SHA-1" ad-hoc verification method.

OK, I'm just sounding off here, it's not something I'm actually
interested in pushing.  Since we're already at the stage where all
"respectable" processors take only a few seconds on average to produce
a full-value stamp, is the variance really a problem?

Problem would be the 366Mhz laptop that I use quite a bit because it weights about 1/2 what my work 3Ghz laptop weighs, coupled with the variance examples anonymous gave -- eg say it takes 20sec instead of 1sec, and you get unlucky and it takes 4x longer than expected...

...then your mail is delayed for a couple of minutes before being sent. Big deal. It also implies, at present minting speeds, that you're using something like a 26-bit stamp, which is a bit bigger than most proposals to date. :)

Yes, I know I'm arguing both sides of this story. This is a relatively rare case for me - I don't actually have a strong opinion either way. Yet.

There is a slight difference in the real minting routine, which
actually assumes it's going to be fed a v1 stamp and therefore puts a
separator between the random and count fields.  I still don't see the
value in that separator, especially if the double-spend database is
going to contain the hash rather than the random field.  :)

Yes the reason for the separator if I recall was to recognize where the random part ended. There are issues with keeping just the stamp output:

- if you do sub-stamps you have to avoid re-use of sub-stamps

- crypto conservatively you want a fresh computation -- storing the
output does not ensure that, someone could make another computation
with the same random field

And finally its safer I think to have people not try (mentally)
combine randomness and counters.  If they are told ok, put 96-bit
randomness here, and then count as required here, less likely to be
security critical coding errors.

OK, I think I see your point. You can avoid the database storage and re-use problems by storing exactly 16 characters (96 bits) of the random field, along with the hash of the whole stamp. The final point is what really motivates the separator IMO, the rest can be done by just concatenating the random and counter.

So some musings towards retaining clean, short stamps (aka taking the
performance advantage out of padding):

- define max padding (eg 96 to 128-bits)

- define max counter-width (per sub-stamp counter) -- careful there is
no max, but practically eg 16 digits of base64 should be "enough"
future-proofing because takes forever to count to 2^96

I think these together should be enough to prevent utter stupidity - but only if they're also accompanied by a stipulation on the size of the random field to match, and if they apply across the entire stamp, not just the primary (otherwise we'll get people using maximum padding on each substamp). If we're going to store 16 characters from the random field in our database, perhaps that field should always be 16 characters wide, no more, no less.

alternative: just give up; how much perf-difference does it make?


OK, let's run estimates based on:

- The real MMX code I wrote yesterday, on an idealised Pentium-MMX that can execute two instructions on every cycle.

- An idealised processor which offers maximum efficiency (and looks suspiciously like a scalar PowerPC). This processor is capable of issuing and executing 3 simple integer ops per cycle, with single-cycle latency, has hardware rotate capability, and plenty of registers for both temporaries and a circular W buffer.

- The same instructions as the 3-issue idealised processor, but only single-issue - simulating a multi-pipe implementation, or one using Altivec.

The brute-force SHA-1 computation (standard, method-B version) requires:
- 16 rounds of F1()
-  4 rounds of F1() with XOR stage
- 40 rounds of F2() with XOR stage (F2() and F4() are identical)
- 20 rounds of F3() with XOR stage

Alternatively (this is the "compact" version):
- 64 rounds of XOR stage, filling W buffer
- 20 rounds of F1()
- 40 rounds of F2()
- 20 rounds of F3()

On the Pentium-MMX, the two versions are almost the same speed - the standard version is about 4% faster than the compact version. I'll use the standard version for these estimates, because it's faster on all machines capable of running it as efficiently as the compact version, but there really isn't a great deal in it.

Instruction counts (cycle counts):
            MMX     Ideal
F1        17  (9)   9 (4)
F1+XOR    27 (14)  13 (5)
F2+XOR    26 (13)  12 (5)
F3+XOR    29 (15)  14 (6)

Note that the "Ideal" times above don't take into account ILP between the end of one round and the beginning of the next, so there could be as much as 1 cycle less on each round involving an XOR stage. The MMX times, however, don't really have this possibility. Let's add up the totals for the brute-force algorithm:

MMX: 1020 cycles (NB: real CPU takes about 1300 cycles - L1 cache latency?)
Ideal: 404 cycles
Altivec: 956 cycles

Using Malcolm's work-reduction techniques and modifying only word 12 (bytes 48-51), we can shave off 12 F1 rounds and 11 XOR stages. This saves:

           F1   XOR  Both
MMX:      108    55   163 cycles
Ideal:     48    11    59 cycles (except that XOR *might* be "free")
Altivec:  108    44   152 cycles

The theoretical percentage saving is:
MMX:     16.0%
Ideal:   14.6% (if XOR has a cost)
Ideal:   14.1% (if it doesn't)
Altivec: 15.9%

These are all in the same ballpark as the 15% that Malcolm originally suggested.

A 15% boost is enough to make my 7450 running unoptimised Altivec, or Athlon running optimised MMX, reach 3.0 million hashes a second, rather than the 2.5-2.6 they do now. It's worthwhile, but not spectacular. The Pentium-MMX would reach 345000, taking it down to 3 seconds on average for a 20-bit token. Given that it's already below 3.5 seconds, that's not spectacular either.

It looks slightly more impressive when you consider it takes a 24-bit token down to 48 seconds from 56, on the latter machine. But that means, from the geometric distribution discussed earlier, the probability of completing within 1 minute is now about 70%, instead of 66% - again, not a huge improvement.

So could have an option to do it, or do it by default and turn it off
automatically if you start using large extensions.  Or if advantage is
smallish (eg 1% range) just not do it (at least by default) and
concede that to the spammers.

Problem is, I just checked the length of my own e-mail address. An unpadded v1 stamp for "chromi@xxxxxxxxxxxxxxxxxxxxx" ends up as 65 characters, which makes for a lot of padding to the optimal case. Unless recipients have very short addresses - which most don't - they'll get huge amounts of padding, which will certainly wrap to the next line and look ugly.

Beware, however, that spammers will presumably be able to insert junk into the extension field, so they get as much padding as they want, so it doesn't really make sense to artificially limit the allowed padding too much. So they get an extra 15% performance? Big deal, recipients can wipe that out manifold with a 1-bit depreciation, and senders can decide the ugliness is justified to get the performance themselves. So, make it optional.

Large extensions don't really affect the picture too badly - the padding is still to the next 64-byte boundary (minus some constant), which actually looks smaller next to a big extension. It's possible that a carefully-designed extension might reduce the padding needed, or even provide it automatically - the plain stamp with an average-sized address is actually near the worst case.

Another way to handle it would be to automatically pad the counter to the optimal position internally, but again this adds complexity and eliminates an alternative verification route.

Finally, the padding doesn't have to be to the "optimal" position. Counting in word 7 rather than word 12 makes you do 3 extra XOR stages and 5 extra F1 rounds per iteration, which still nets you better than half the boost - perhaps near 10%. The padding turns out to be to the 96+N*64 character mark, which is 29 characters for my own address and an 8-character count field. It still wraps to the next line on an 80-column screen, but it's not excessively long any more.

Food for thought, anyway.

from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

Other related posts: