[hashcash] Re: Hashcash performance improvements

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Sat, 29 May 2004 01:46:33 +0100

>> I recently sent Adam an updated version of libsha1.c, which is about
>> 20% faster on x86 machines.  After some more thought in this vein, I'd
>> like to help write a set of "fast minting" routines which maximise
>> sustained minting performance on a given class of machines.
>
> In that case, I'd like to help out if I may.  Optimising assembly 
> language is
> a hobby of mine.  Although I mostly have experience of PPC so it seems 
> we
> overlap a bit there.
>
>> The minter routine would then append the random salt string, which
>> would be at least 64 (or 96) bits worth but extended (if necessary) to
>> allow minting to progress on a single SHA-1 block - which is much more
>> efficient than calculating SHA-1 afresh for the entire message.
>
> A couple of further suggestions: first, some of the hash computation 
> can be
> calculated just once.  For example, in the stamp
>
>       1:20:040428:foo::0298b26853c8ebb3:beca0
>
> there are 34 bytes up to and including the last ':'.  This means that 
> the
> first eight words of the block are constant, so the first eight steps 
> of the
> 80 can be calculated once and then re-used.  This could mean a 10%
> improvement, roughly.
>
> Second, the random salt string could even be extended to take up more 
> of the
> block.  Suppose the stamp above is extended to have 48 bytes up to the 
> last
> ':', then 12 words remain constant and hence 12 steps out of 80 don't 
> need to
> be repeated, a possible 15% speedup.

Good catch.  This could fairly easily be implemented by adding an inner 
loop to each routine, although it does make the code slightly messier.  
I'll keep it in mind.

At the moment, I'm filling in random data until the following 
conditions are satisfied:

- There are at least 16 characters of random (96 bits, as it's in 
base-64).

- After a separator, an 8-character count field (enough for a 32-bit 
count in base-16) won't straddle a block boundary.

It would be easy enough to change the second condition to something 
like "...will be aligned at the end of the block".

> Obviously this leaves us looking at a trade-off between stamp size and 
> minting time.

Indeed.  In this scheme, the possible maximum-efficiency stamp sizes 
are 55, 55+64, 55+128, etc. (55 is the maximum byte count in the final 
block - there needs to be space for the end-of-message marker and the 
8-byte message bit count).  That means we could potentially be looking 
at a line-full (63 bytes) of extra padding near the end of the stamp, 
depending on the size of the resource and extension fields.  Does this 
matter?

>> Hexadecimal and base64 strings are both feasible, though the tail of
>> the salt string may be more efficiently coded in some subset of 
>> base64,
>> eg. the first 16 letters of the alphabet.
>
> I agree, this would make advancing through the counter values cleaner 
> and faster.

My point exactly.  At the moment, however, I'm just using an index into 
a base-64 alphabet string, which appears to be fast enough for now.  
When we get down to the really fine detail, we'll be able to see 
whether this overhead is significant or not.

The compiler seems to boil it down to an indexed byte load and byte 
store, which is nearly trivial.  It is also easily overlapped during 
the beginning of a multi-pipe algorithm.  If the tail will be in a 
fixed position, it gets even cheaper.

> This could also depend on the particular mint function, for example,
> an optimised PPC G4 function might use one set of vector registers and 
> two
> sets of integer registers to implement a six pipeline minter.  Then it 
> might
> be convenient to use the first 24 letters of the alphabet, and perhaps 
> a
> simple matter to use both upper and lower case for a total of 48 
> different
> values per byte.  (The G4 can carry out one vector and two integer ops 
> per
> clock cycle, so this is a likely scenario).

That depends on the particular G4 you're talking about (the 7400 and 
745x are noticeably different), and the G5 is different again, but yes, 
the principle is valid.

At the moment, I'm working on functions which try to be generically 
efficient on any compatible processor, rather than trying to 
micro-optimise for any particular model.  I think I'll leave that part 
up to you.  :)

--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.


Other related posts: