[hashcash] Re: 300Mhz amd k6-2
- From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
- To: hashcash@xxxxxxxxxxxxx
- Date: Sun, 6 Jun 2004 01:24:37 +0100
so 429635 for a 300Mhz cpu vs 3741983 for a 3.06Ghz cpu is a factor of
8.7 slower but a factor of 10.2 difference in clock rate. AMD k6-2 is
doing not bad for a 2 generations older 1/10th clock rate cpu.
Yup, I think the K6 architecture is decoder-heavy like the Athlon is.
However, the K6 is really only one generation older than the P4 - it's
a contemporary of the so-called "P6" architecture, aka the Pentium-Pro,
Pentium-II and Pentium-III. We tend to forget this, as the Athlon/P4
generation lasted so long.
It's actually almost amusing - of all the CPU architectures I've read
the documentation on this week, only the Intel machines seem to be
decoder-starved. Everything else is capable of saturating the integer
and load-store units using ordinary instructions, sometimes several
times over. And that includes the old 68040, another CISC design.
I've had occasion to coax a few of my older Macs back to life - and
believe me, it did take some coaxing, for various reasons:
PowerBook 5300ce (117MHz PowerPC 603e): 145000
(I had to splice together a new power cable for this one. I don't
think it had been used in about 4 years beforehand.)
PowerMac 8100/80 (80MHz PowerPC 601): 70000
(The G3 upgrade had a funny connection problem, so I had to remove it,
rather than just disabling it, to get these numbers.)
Both of these older PowerPCs have only one integer unit for "simple"
operations like those used in cryptography. The 601 is also hampered
by using the same integer unit for loads and stores as well. The 603e
also has a very limited number of simple-integer operations that can
also be executed by the System Register Unit.
I don't have a 604e to test on, but it has two simple-integer units, a
separate load-store unit, and dynamic branch prediction - and it can
issue up to four instructions per clock. This should place it in the
same general category as the G3 (see below) for per-clock efficiency on
this type of code.
Quadra 840AV (80MHz 68040): 11700
(One of the internal hard drives seemed to be suffering from
"stiction". Strangely, disconnecting the CD-ROM drive seemed to be the
tonic it needed. Ah, the days of SCSI voodoo...)
The 68040 (a 486-era and -class CPU) has a very small L1 cache (4K+4K
Harvard), and would benefit from re-rolling the loops in the SHA-1
code, unlike all the newer machines! Reading the manuals very
carefully, it appears that the '040 should be able to run nearly
(within perhaps 50% or better) as fast as the 601, provided the code
fits in the cache.
I'll need to spend some time on this, as I consider it fairly important
to avoid excluding 68K Mac owners. Watch out for cores labelled
"Ultra-Compact" in future updates. I also have an ancient 68030 Mac to
test on, though I really do expect that one to be a dog - but I also
know there are a lot of people out there who really cannot afford
anything better.
The above numbers also include implementation of the work-reduction
techniques Malcolm suggested, for final-block lengths of exactly 32 and
52 - these correspond to counting in words 7 and 12 respectively. The
benchmark itself uses position 52, which is the optimal position.
Actual performance boosts seem close to the 15% predicted.
Unfortunately, the most efficient implementation also causes a certain
amount of duplication and spaghetti-ness in the code. Not all of the
cores are updated yet, as it's taking quite some time to do so - I
haven't done the MMX routines yet, for example. I'll only transmit
an update when this is finished. However, here are some numbers from
the newer PowerPCs as well:
PowerBook G4 (667MHz PowerPC 7450): 3000000 (was 2500000)
PowerBook G3 (400MHz PowerPC 750): 700000
Finally, here's a sample of the new detailed output at the end of the
benchmark run:
Best minter: PowerPC Altivec Standard 1x4-pipe (2938959 hashes/sec)
Projected average times to mint:
8 bits: 0.000 seconds (87.1 microseconds)
10 bits: 0.000 seconds (348.4 microseconds)
16 bits: 0.022 seconds
20 bits: 0.357 seconds
22 bits: 1.427 seconds
24 bits: 5.709 seconds
26 bits: 22.834 seconds
28 bits: 91.337 seconds
30 bits: 365.348 seconds (6.1 minutes)
It will also provide "hours" and "days" estimates for particularly slow
machines if necessary - the "days" showed up when testing the 68K port
under emulation on the 8100. The list of bit-counts it provides
statistics for is easily adjusted by checking an array, though the code
presently only supports up to 30 bits for this purpose.
The cores, however, support minting up to 64 bits, and can, if you're
really that sadistic, be used for even higher counts by checking and
retrying. They do require retrying to handle more than 2^32 iterations
anyway. The provided calling routine will automatically handle
retrying, although it may eventually run out of stack space if it is
forced to recurse too far.
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: chromi@xxxxxxxxxxxxxxxxxxxxx
website: http://www.chromatix.uklinux.net/
tagline: The key to knowledge is not to rely on people to teach you it.
Other related posts: