[hashcash] Re: Hashcash performance improvements
- From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
- To: hashcash@xxxxxxxxxxxxx
- Date: Mon, 31 May 2004 09:37:17 +0100
Well there are the guidelines. In any case, a limit should be named.
Standards normally have arbitrary limits written into them, as
them no computer with finite memory could ever claim to be standards
I have a feeling the 512 bytes of SMTP should be the guideline, minus
some reasonable amount for the command itself. That's over 6
Sounds reasonable. People might chew into it a bit because eg CAMRAM
was proposing to transport a public key. Now these are eg 1024-bit +
range, that's only 2-3 lines, but they have to be careful what other
cruft they put in there.
The way I see it, if you really need to transport a lot of data for
authentication, you should use your own header to do it. If you want
it to be covered by hashcash, you can do a hash of your header and put
the hash in the extension field, with equivalent results.
I'd like to rewind this whole discussion a bit though and point out
that I don't see the need for the sub-puzzles. Is the only reason
that it makes minting time more predictable by reducing the
variance? I can't see much value in that, although maybe I will
when I'm desperate to send an email quickly :-)
I think that's the primary goal, yes.
I believe that was Izzy's motivation. I'm kind of hung on this one, I
did consider it also earlier (well back in 97 already I was thinking
whether 2-way or k-way collisions would be better technically, but
rejected for complexity). I hate adding complexity, and sub-stamps
add complexity. (I like the fact that you can verify stamps with
sha1sum binary plus a bit of sh script).
Yes, I buy this argument, at least a bit. Adding complexity also makes
us less likely to get MUA vendors on board, especially the ones that
want to make their own implementation (for whatever corporate reason,
which may not be entirely rational).
We'd need to think carefully about how to limit this damage, and
whether the benefits outweigh the additional complexity in the spec.
An obvious solution is to limit the number of microstamps allowed per
macrostamp - this automatically puts a lower limit on the value of
microstamp. Processing of a macrostamp should stop as soon as an
invalid microstamp is found, but valid microstamps should be kept so
that the same stamp can't be reused in the same attack.
I presume you mean would have to avoid:
being considered valid with the over-writing semantics.
Well, that and sending a batch of mails with the same set of stamps
The down-side of the approach I proposed earlier is speed-optimized
versions might literally try padding out to another 64-9 boundary for
each sub-stamp, which would eat into extension space and make ugly
less human readable-stamps.
Indeed. But if you use a known pad character and shorthand, then the
stamp appears small in the header and is only expanded for
verification. But again, this rules out the old "pipe into SHA-1"
ad-hoc verification method.
OK, I'm just sounding off here, it's not something I'm actually
interested in pushing. Since we're already at the stage where all
"respectable" processors take only a few seconds on average to produce
a full-value stamp, is the variance really a problem?
Problem would be the 366Mhz laptop that I use quite a bit because it
weights about 1/2 what my work 3Ghz laptop weighs, coupled with the
variance examples anonymous gave -- eg say it takes 20sec instead of
1sec, and you get unlucky and it takes 4x longer than expected...
...then your mail is delayed for a couple of minutes before being sent.
Big deal. It also implies, at present minting speeds, that you're
using something like a 26-bit stamp, which is a bit bigger than most
proposals to date. :)
Yes, I know I'm arguing both sides of this story. This is a relatively
rare case for me - I don't actually have a strong opinion either way.
There is a slight difference in the real minting routine, which
actually assumes it's going to be fed a v1 stamp and therefore puts a
separator between the random and count fields. I still don't see the
value in that separator, especially if the double-spend database is
going to contain the hash rather than the random field. :)
Yes the reason for the separator if I recall was to recognize where
the random part ended. There are issues with keeping just the stamp
- if you do sub-stamps you have to avoid re-use of sub-stamps
- crypto conservatively you want a fresh computation -- storing the
output does not ensure that, someone could make another computation
with the same random field
And finally its safer I think to have people not try (mentally)
combine randomness and counters. If they are told ok, put 96-bit
randomness here, and then count as required here, less likely to be
security critical coding errors.
OK, I think I see your point. You can avoid the database storage and
re-use problems by storing exactly 16 characters (96 bits) of the
random field, along with the hash of the whole stamp. The final point
is what really motivates the separator IMO, the rest can be done by
just concatenating the random and counter.
So some musings towards retaining clean, short stamps (aka taking the
performance advantage out of padding):
- define max padding (eg 96 to 128-bits)
- define max counter-width (per sub-stamp counter) -- careful there is
no max, but practically eg 16 digits of base64 should be "enough"
future-proofing because takes forever to count to 2^96
I think these together should be enough to prevent utter stupidity -
but only if they're also accompanied by a stipulation on the size of
the random field to match, and if they apply across the entire stamp,
not just the primary (otherwise we'll get people using maximum padding
on each substamp). If we're going to store 16 characters from the
random field in our database, perhaps that field should always be 16
characters wide, no more, no less.
alternative: just give up; how much perf-difference does it make?
OK, let's run estimates based on:
- The real MMX code I wrote yesterday, on an idealised Pentium-MMX that
can execute two instructions on every cycle.
- An idealised processor which offers maximum efficiency (and looks
suspiciously like a scalar PowerPC). This processor is capable of
issuing and executing 3 simple integer ops per cycle, with single-cycle
latency, has hardware rotate capability, and plenty of registers for
both temporaries and a circular W buffer.
- The same instructions as the 3-issue idealised processor, but only
single-issue - simulating a multi-pipe implementation, or one using
The brute-force SHA-1 computation (standard, method-B version) requires:
- 16 rounds of F1()
- 4 rounds of F1() with XOR stage
- 40 rounds of F2() with XOR stage (F2() and F4() are identical)
- 20 rounds of F3() with XOR stage
Alternatively (this is the "compact" version):
- 64 rounds of XOR stage, filling W buffer
- 20 rounds of F1()
- 40 rounds of F2()
- 20 rounds of F3()
On the Pentium-MMX, the two versions are almost the same speed - the
standard version is about 4% faster than the compact version. I'll use
the standard version for these estimates, because it's faster on all
machines capable of running it as efficiently as the compact version,
but there really isn't a great deal in it.
Instruction counts (cycle counts):
F1 17 (9) 9 (4)
F1+XOR 27 (14) 13 (5)
F2+XOR 26 (13) 12 (5)
F3+XOR 29 (15) 14 (6)
Note that the "Ideal" times above don't take into account ILP between
the end of one round and the beginning of the next, so there could be
as much as 1 cycle less on each round involving an XOR stage. The MMX
times, however, don't really have this possibility. Let's add up the
totals for the brute-force algorithm:
MMX: 1020 cycles (NB: real CPU takes about 1300 cycles - L1 cache
Ideal: 404 cycles
Altivec: 956 cycles
Using Malcolm's work-reduction techniques and modifying only word 12
(bytes 48-51), we can shave off 12 F1 rounds and 11 XOR stages. This
F1 XOR Both
MMX: 108 55 163 cycles
Ideal: 48 11 59 cycles (except that XOR *might* be "free")
Altivec: 108 44 152 cycles
The theoretical percentage saving is:
Ideal: 14.6% (if XOR has a cost)
Ideal: 14.1% (if it doesn't)
These are all in the same ballpark as the 15% that Malcolm originally
A 15% boost is enough to make my 7450 running unoptimised Altivec, or
Athlon running optimised MMX, reach 3.0 million hashes a second, rather
than the 2.5-2.6 they do now. It's worthwhile, but not spectacular.
The Pentium-MMX would reach 345000, taking it down to 3 seconds on
average for a 20-bit token. Given that it's already below 3.5 seconds,
that's not spectacular either.
It looks slightly more impressive when you consider it takes a 24-bit
token down to 48 seconds from 56, on the latter machine. But that
means, from the geometric distribution discussed earlier, the
probability of completing within 1 minute is now about 70%, instead of
66% - again, not a huge improvement.
So could have an option to do it, or do it by default and turn it off
automatically if you start using large extensions. Or if advantage is
smallish (eg 1% range) just not do it (at least by default) and
concede that to the spammers.
Problem is, I just checked the length of my own e-mail address. An
unpadded v1 stamp for "chromi@xxxxxxxxxxxxxxxxxxxxx" ends up as 65
characters, which makes for a lot of padding to the optimal case.
Unless recipients have very short addresses - which most don't -
they'll get huge amounts of padding, which will certainly wrap to the
next line and look ugly.
Beware, however, that spammers will presumably be able to insert junk
into the extension field, so they get as much padding as they want, so
it doesn't really make sense to artificially limit the allowed padding
too much. So they get an extra 15% performance? Big deal, recipients
can wipe that out manifold with a 1-bit depreciation, and senders can
decide the ugliness is justified to get the performance themselves.
So, make it optional.
Large extensions don't really affect the picture too badly - the
padding is still to the next 64-byte boundary (minus some constant),
which actually looks smaller next to a big extension. It's possible
that a carefully-designed extension might reduce the padding needed, or
even provide it automatically - the plain stamp with an average-sized
address is actually near the worst case.
Another way to handle it would be to automatically pad the counter to
the optimal position internally, but again this adds complexity and
eliminates an alternative verification route.
Finally, the padding doesn't have to be to the "optimal" position.
Counting in word 7 rather than word 12 makes you do 3 extra XOR stages
and 5 extra F1 rounds per iteration, which still nets you better than
half the boost - perhaps near 10%. The padding turns out to be to the
96+N*64 character mark, which is 29 characters for my own address and
an 8-character count field. It still wraps to the next line on an
80-column screen, but it's not excessively long any more.
Food for thought, anyway.
from: Jonathan "Chromatix" Morton
tagline: The key to knowledge is not to rely on people to teach you it.
Other related posts: