[hashcash] Re: Hashcash performance improvements

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Sun, 30 May 2004 03:47:37 +0100

>> This makes a slight conflict between the senders interests and the
>> receivers (hostile sender may send massive stamp which costs nothing
>> extra practially to compute).  If we cared about this we would ideally
>> want to enforce small stamps on sender (measured in SHA1 blocks) to
>> make the receivers verification fast.  Simple "enforcement" would be
>> if it cost the sender disproportionately more.
> As Jonathan has shown, the time taken to check a stamp is negligible, 
> however
> there is still a matter of storage space.  Quoting Adam from 29th 
> April:
> | > Does the recipient even need to care about the random field?
> |
> | He might care to enable him to store something compact in his double
> | spend database.  He need only store the date, recipient, and random
> | field.
> A maliciously large random field could then be used to fill up 
> somebody's
> double spend database, or at least hurt performance if the DB is 
> designed
> on the assumption that it contains small data.  I think it might be 
> fair
> to place some arbitrary limit on stamp size to protect against 
> extremes -
> 5000 bytes for example.  This should be much more than enough to 
> satisfy
> any conceivable future expansion, but small enough to restrict 
> processing
> time and storage space to manageable (negligible?) levels.

I think this is fair.  Is there already a size limit on e-mail headers, 
though?  If so, we should use that as a guide.

>> One way to do this would be to put the counter at the beginning,
>> however this is typically dangerous because then work can be
>> pre-computed and shared across multiple recipients.
> I don't think there's a significant problem here.  Carrying out some
> precomputation once for many recipients instead of doing it once for
> each recipient will only save around a second per many million 
> recipients.

Also, storing enough precomputed counters to make the job worthwhile 
would cause memory bandwidth to be a limiting factor, which is going to 
be a lot slower than just computing the thing in the first place.

> All in all I think there should be a limited set of safe characters 
> which
> will never change.  Hex digits is one option, base64 is another.
> That applies to both the random and counter fields, by the way.

Strongly agree.  The base64 alphabet is well-established and relatively 
easy to encode, because we can easily extract a set of 6-bit fields 
from any number and use them in a lookup table.  Why make life hard for 

>> It might also help in some sense if we chose things which are simple
>> to code efficiently, with no unrolled pre-computed tricks available to
>> keep code simple if that makes same overhead for spammer and normal
>> user and results in simpler, less unrolled pre-computed, funny-order
>> counted, manipulated aligned field tricked etc.
>> (So think about that also in your optimization -- what would help
>> fairly prevent this optimization and keep things simpler -- we could
>> adapt v1 spec.)
> Interesting, I never thought of making optimisations impossible, only
> taking advantage of them.  My opinion now is that some optimisation 
> will
> always be available to those who are willing to find tricks and create 
> the
> required spaghetti code.  Creating a system which cannot be optimised
> would be about as difficult as creating a secure hash function in the
> first place.
> I think the stamp format is fine as it is and would not benefit from 
> any
> rearrangement simply to enforce minting times.  Remember we are only 
> talking
> about 10-15% improvements over an efficiently implemented SHA-1 
> function.
> If hashcash is too easy to compute, this is equivalent to saying that
> SHA-1 is broken.

Also agreed.  I'd rather have a format where the potential 
optimisations are relatively obvious, and get within a few percent of 
that ideal, than one where we've eliminated all the obvious 
optimisations but in which an attacker might find others (which he 
isn't going to share with us!).

>> Izzy Kindred was also suggesting earlier that version 1 should support
>> multiple sub-puzzles; if we do this it will make align at end of SHA1
>> input block work less well.  As there will be multiple counters.
> Simply align every counter by padding out to the next boundary:
> 1:10,4:040525:foo::c82a215126afb2c2:0000000000000000416,
> 000000000000000000000000000000000000000000000000000000000005970,
> 000000000000000000000000000000000000000000000000000000000006119,
> 000000000000000000000000000000000000000000000000000000000007023
> Line breaks added for readability.  The first portion (not including 
> the
> comma) is 55 bytes, then each counter adds 64 bytes == 1 block.  Best
> computation time, horrible to look at though.  The extra puzzles just
> added an incentive to add more padding.

How about if the secondary counters are the same length as the primary? 
  Then they can be overlaid on the primary field, so there's less 
padding involved - and leading zeroes could be omitted from the 
secondaries.  The problem then changes to "find the first N counter 
values that produce X-bit value", which is on-average equivalent to 
"find the first counter value that produces (X+log2(N))-bit value", and 
remains easy to implement.

The above example could be recoded:

>> It would be easy enough to change the second condition to something
>> like "...will be aligned at the end of the block".
> Having considered how to tie these fast minting routines together, and
> looked over some other issues, here is what I think:
> Padding should be added to the counter, not to the random field.  Where
> you have defined the setup to be
>       - Common code assembles stamp string,
>       - Efficient mint function appends random and calculates counter,
> a better way would be
>       - Common code assembles stamp including random,
>       - Efficient mint function calculates counter.

I already do something like this, actually.  The random field is added 
and the SHA-1 block is assembled by a central routine, which then looks 
up a function pointer to the optimised minter in a table and calls it.  
The optimised minter then overwrites part of the pre-assembled SHA-1 
block with the counter.

Now, admittedly, padding with random data does potentially consume more 
entropy from the system's RNG, of which there is often a limited 
supply, but I don't know if that's significant or not, considering 
there's a whole lot of computation going on in-between each set of RNG 
calls.  For systems that are specifically built to bulk-generate 
hashcash, I suppose any such limitation could be circumvented by adding 
a hardware entropy generator - which is already present in a 
significant fraction of x86 systems.

>> That means we could potentially be looking
>> at a line-full (63 bytes) of extra padding near the end of the stamp,
>> depending on the size of the resource and extension fields.  Does this
>> matter?
> I guess we could just provide a selection of minting functions, some
> "fast", others "small".  There's no one right answer to the question of
> which is more important.

True.  I think most people, at least at the beginning, will have 
relatively small stamps, so padding them to the (55+N*64)-byte mark 
will still result in a 1-line stamp, which is quite acceptable.  One 
wonders, however, about the minority of people who have unusually large 
e-mail addresses, or need to use the extension field for something.

>> At the moment, I'm working on functions which try to be generically
>> efficient on any compatible processor, rather than trying to
>> micro-optimise for any particular model.  I think I'll leave that part
>> up to you.  :)
> I agree, and we're making good progress here at finding speed tricks 
> which
> can apply to all processors.  Now put your thinking caps on, because
> here's some more potential speedups:
> <snip>

> So W[12], which is bytes 48..51 of the block, does *not* affect the 
> values
> of 11 of the W values which are calculated.  This means that these 11
> values can be cached for all blocks which only vary in those bytes.  
> This
> also fits in very conveniently with the precalculating trick suggested
> before.

Bear in mind that for a single-pipe routine running on a superscalar 
processor, the XOR and shift calculations are essentially free, because 
they can very easily be overlapped with the other half of each round 

Even for a double-pipe routine, omitting the XOR calculations only 
begins to make sense when less than four bitwise operations can be done 
in parallel, which is still only true for a minority of modern 
processors (which happens to include the G3).

The XOR stage can also be done one or two rounds in advance without 
consuming any extra load-store bandwidth or more than one additional 
register, which is particularly helpful on deeply pipelined machines.

At the moment, the single-pipe routines I have are still faster than 
the double-pipe ones (except, paradoxically, on the P4), though mainly 
because neither GCC nor CodeWarrior seem capable of keeping track of 
the huge number of operations and temporaries in a double-pipe routine. 
  I've done all kinds of juggling to try and make the compiler behave, 
but it's been only slightly more effective than headbutting a brick 

I believe I can still get more performance out of a double-pipe 
algorithm, but I'll have to do it by writing assembly (or, preferably, 
generating it using a custom macro processor) rather than relying on 
the compiler.  I think the time to do that is after implementing the 
single-pipe MMX minter, which I plan to do today.

The custom macro processor would also allow special improvements to the 
single-pipe routines, including storing the entire 16-entry circular W 
buffer in registers (on the PowerPC, that is - it's the only popular 
machine with enough registers).  Regrettably, I don't know of any 
architecture (besides the Itanic, which doesn't really count) which has 
enough registers to store two versions of the W buffer at once!

Just to recap, I'd like to make my goals plain:

- Good performance on the vast majority of consumer hardware, so few 
consumers need to "buy" stamps rather than generating them.  The 
improvements on x86 so far, and the potential performance using 
carefully optimised MMX, combined with the fairly spectacular results 
on scalar PowerPC, all look promising for this.  If possible, extension 
of tests to 68K and ARM hardware would be helpful, as these 
architectures are in wide use in the mobile arena.

- At the very least, I want low-value stamps (which I term 
"signature-grade") to be available to all reasonable hardware (meaning 
"anything that can run a TCP/IP stack") within a few seconds.  I 
already know this is practical, because I pseudo-implemented a 
generator for the 6502 - it turned out to be capable of generating an 
8-bit token in about 5 seconds.  :)

- Excellent to spectacular performance on commodity server-grade 
hardware, so that the cost of buying stamps is low for those who still 
need to.  The good first-generation results using Altivec, and the 
potential for similar performance using SSE2 on AMD64, seem very 
promising on this front.

- Better performance on conventional hardware also makes the economics 
of building custom hardware less attractive.  This should at least 
delay the "arms race", even if it doesn't eliminate it.

from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

Other related posts: