[hashcash] Re: Hashcash performance improvements

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Sun, 30 May 2004 18:52:39 +0100

btw it dumps core with -O3 -funroll-loops -march=pentium4 -mmmx before
outputting anything on P4.

Hmm. I suspect a bug in the cpuid test, which I've corrected just now - the cpuid instruction clobbers several registers, which need to be accounted for along with the actual result code.


I did a minor optimisation that might remove a data-dependency stall between each round - it appears to be slightly faster on the Athlon, but no change on the Pentium-MMX. I also updated the "MMX Compact" routine to use optimised assembly in the right places. The latter is still a little slower than the MMX Standard routine on both my PCs, so I've moved it out of the "preferred" position in the list.

These updates are attached.

  3515196 AMD64/x86 MMX Standard 1x2-pipe
  1432117 AMD64/x86 MMX Compact 1x2-pipe *

The P4 likes the mmx assembler!

Indeed it does. It's still not quite as fast per-clock as my Athlon, but it obviously does respond well to code optimisation.


I think what we're seeing here is the Athlon having a preponderance of decode and scheduling logic, when compared to it's back-end execution resources. This is a tradeoff that makes a lot of sense, given that the x86 ISA has an unusually small amount of instruction-level parallelism available, as a direct result of the tiny register file. The Athlon-64 keeps a similar decode engine (though expanded for the AMD64 support), but optimises the execution units to be more efficient.

The P4 has higher back-end throughput from it's deep pipelines and high clock, but it pays for the high clock with a far less flexible decode engine. That's why the P4 works so well with SIMD instructions, which have few instructions (taking up less decode bandwidth) compared with the amount of work to do. Based on this insight, we can extrapolate that if SSE2 has a similar range of integer instructions to MMX, we might see as much as twice the performance over MMX - assuming it doesn't run out of back-end resources first.

The G4 also has this kind of property, since the RISC design allows an exceptionally simple decode unit to have high throughput, but most of it's execution resources are dedicated to the Altivec engine, and are completely unavailable to scalar programs. Both the P4 and Athlon use the normal scalar execution units for most SIMD work, so they have less peak throughput per clock, but have more available to scalar programs.

I think the G5 is a bit more balanced than the G4. It has a heavier decode engine, despite the RISC background, which allows it to effectively address more scalar units. The Altivec engine in the G5 is relatively weak, though it still offers a throughput advantage over the scalar units. However, I don't have one to hand, so I can't tell you how fast it does hashcash. =)

btw I was thinking it would be useful to have a selection of hardware
with linux shell accounts for people who are working on this.

Don't OSDL, or people like them, have a facility for that?


--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

Other related posts: