I recently sent Adam an updated version of libsha1.c, which is about 20% faster on x86 machines. After some more thought in this vein, I'd like to help write a set of "fast minting" routines which maximise sustained minting performance on a given class of machines. I think this is in our best interests, as determined spammers will do this kind of optimisation as soon as it becomes profitable for them, and we want to stay abreast of them at worst. Optimising the software available to the "good guys" limits the advantage spammers can gain over us, at least until they decide to use dedicated hardware. I've already determined that the vanilla x86 instruction set is incapable of computing SHA-1 hashes without spilling working variables to memory - the revised routine reduces this problem but doesn't eliminate it. There are 5 explicit working variables, 2 temporaries, and one frequently-accessed constant needed, but the x86 only has 8 GPRs, some of which are needed for system purposes. It should, however, be possible to compute SHA-1 more efficiently using MMX, which has a set of 8 registers separate from the GPRs - just enough to store all the needed values. Additionally, MMX would be able to compute two hashes simultaneously, potentially doubling the collision rate. I estimate that an efficient MMX routine could be as much as 4 times as fast as the present implementation - that's potentially 4 million collision tests per second from a recent Athlon. One disadvantage, though, is that MMX has no rotate instruction - it takes 4 instructions (move, shift left, shift right, or) to emulate that operation. Even more efficient routines could possibly be written using SSE2, but I don't have hardware capable of running such code. Someone else might want to implement an SSE2 extension once the MMX and Altivec routines are mature, so I'll try to ensure such extensions are easy to add. PowerPC machines (meaning Macs) have a rich set of registers and are not limited by spilling. I still think a faster implementation is possible, by running multiple "pipes" alongside each other and thus taking advantage of pipeline bubbles. Other register-rich architectures, including perhaps ARM, could benefit from the same code. On PowerPC machines supporting Altivec (G4 and G5 at present), there are 32 128-bit vector registers available, so it is theoretically possible to run a 16-pipe hash routine, consisting of 4 pipes each operating on 4-element vectors. This would probably achieve a rate of 1.6 million collision tests per second, or better, on a 400MHz G4. To allow multi-pipe implementations and reduce overhead, it's necessary to move away from the standard libsha1-style implementation, towards a dedicated minting routine. The standard libsha1 routines can be used to verify the token after minting and upon receipt. I suggest decoupling the token formatting from the hash minting, and simply passing a string and a bit-count request to the minter. The formatting only needs to be done once per token, so doesn't need to be specially optimised. The minter routine would then append the random salt string, which would be at least 64 (or 96) bits worth but extended (if necessary) to allow minting to progress on a single SHA-1 block - which is much more efficient than calculating SHA-1 afresh for the entire message. Hexadecimal and base64 strings are both feasible, though the tail of the salt string may be more efficiently coded in some subset of base64, eg. the first 16 letters of the alphabet. I'd like this to work within the official hashcash distribution, but if that is not possible, I hope there is no objection to my implementing a standalone minter. -------------------------------------------------------------- from: Jonathan "Chromatix" Morton mail: chromi@xxxxxxxxxxxxxxxxxxxxx website: http://www.chromatix.uklinux.net/ tagline: The key to knowledge is not to rely on people to teach you it.