[hashcash] Re: Hashcash performance improvements
- From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
- To: hashcash@xxxxxxxxxxxxx
- Date: Sun, 30 May 2004 18:52:39 +0100
btw it dumps core with -O3 -funroll-loops -march=pentium4 -mmmx before
outputting anything on P4.
Hmm. I suspect a bug in the cpuid test, which I've corrected just now
- the cpuid instruction clobbers several registers, which need to be
accounted for along with the actual result code.
I did a minor optimisation that might remove a data-dependency stall
between each round - it appears to be slightly faster on the Athlon,
but no change on the Pentium-MMX. I also updated the "MMX Compact"
routine to use optimised assembly in the right places. The latter is
still a little slower than the MMX Standard routine on both my PCs, so
I've moved it out of the "preferred" position in the list.
These updates are attached.
3515196 AMD64/x86 MMX Standard 1x2-pipe
1432117 AMD64/x86 MMX Compact 1x2-pipe *
The P4 likes the mmx assembler!
Indeed it does. It's still not quite as fast per-clock as my Athlon,
but it obviously does respond well to code optimisation.
I think what we're seeing here is the Athlon having a preponderance of
decode and scheduling logic, when compared to it's back-end execution
resources. This is a tradeoff that makes a lot of sense, given that
the x86 ISA has an unusually small amount of instruction-level
parallelism available, as a direct result of the tiny register file.
The Athlon-64 keeps a similar decode engine (though expanded for the
AMD64 support), but optimises the execution units to be more efficient.
The P4 has higher back-end throughput from it's deep pipelines and high
clock, but it pays for the high clock with a far less flexible decode
engine. That's why the P4 works so well with SIMD instructions, which
have few instructions (taking up less decode bandwidth) compared with
the amount of work to do. Based on this insight, we can extrapolate
that if SSE2 has a similar range of integer instructions to MMX, we
might see as much as twice the performance over MMX - assuming it
doesn't run out of back-end resources first.
The G4 also has this kind of property, since the RISC design allows an
exceptionally simple decode unit to have high throughput, but most of
it's execution resources are dedicated to the Altivec engine, and are
completely unavailable to scalar programs. Both the P4 and Athlon use
the normal scalar execution units for most SIMD work, so they have less
peak throughput per clock, but have more available to scalar programs.
I think the G5 is a bit more balanced than the G4. It has a heavier
decode engine, despite the RISC background, which allows it to
effectively address more scalar units. The Altivec engine in the G5 is
relatively weak, though it still offers a throughput advantage over the
scalar units. However, I don't have one to hand, so I can't tell you
how fast it does hashcash. =)
btw I was thinking it would be useful to have a selection of hardware
with linux shell accounts for people who are working on this.
Don't OSDL, or people like them, have a facility for that?
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: chromi@xxxxxxxxxxxxxxxxxxxxx
website: http://www.chromatix.uklinux.net/
tagline: The key to knowledge is not to rely on people to teach you it.
Other related posts: