[hashcash] Re: 300Mhz amd k6-2

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Sun, 6 Jun 2004 01:24:37 +0100

so 429635 for a 300Mhz cpu vs 3741983 for a 3.06Ghz cpu is a factor of
8.7 slower but a factor of 10.2 difference in clock rate.  AMD k6-2 is
doing not bad for a 2 generations older 1/10th clock rate cpu.

Yup, I think the K6 architecture is decoder-heavy like the Athlon is. However, the K6 is really only one generation older than the P4 - it's a contemporary of the so-called "P6" architecture, aka the Pentium-Pro, Pentium-II and Pentium-III. We tend to forget this, as the Athlon/P4 generation lasted so long.

It's actually almost amusing - of all the CPU architectures I've read the documentation on this week, only the Intel machines seem to be decoder-starved. Everything else is capable of saturating the integer and load-store units using ordinary instructions, sometimes several times over. And that includes the old 68040, another CISC design.

I've had occasion to coax a few of my older Macs back to life - and believe me, it did take some coaxing, for various reasons:

PowerBook 5300ce (117MHz PowerPC 603e): 145000
(I had to splice together a new power cable for this one. I don't think it had been used in about 4 years beforehand.)

PowerMac 8100/80 (80MHz PowerPC 601): 70000
(The G3 upgrade had a funny connection problem, so I had to remove it, rather than just disabling it, to get these numbers.)

Both of these older PowerPCs have only one integer unit for "simple" operations like those used in cryptography. The 601 is also hampered by using the same integer unit for loads and stores as well. The 603e also has a very limited number of simple-integer operations that can also be executed by the System Register Unit.

I don't have a 604e to test on, but it has two simple-integer units, a separate load-store unit, and dynamic branch prediction - and it can issue up to four instructions per clock. This should place it in the same general category as the G3 (see below) for per-clock efficiency on this type of code.

Quadra 840AV (80MHz 68040): 11700
(One of the internal hard drives seemed to be suffering from "stiction". Strangely, disconnecting the CD-ROM drive seemed to be the tonic it needed. Ah, the days of SCSI voodoo...)

The 68040 (a 486-era and -class CPU) has a very small L1 cache (4K+4K Harvard), and would benefit from re-rolling the loops in the SHA-1 code, unlike all the newer machines! Reading the manuals very carefully, it appears that the '040 should be able to run nearly (within perhaps 50% or better) as fast as the 601, provided the code fits in the cache.

I'll need to spend some time on this, as I consider it fairly important to avoid excluding 68K Mac owners. Watch out for cores labelled "Ultra-Compact" in future updates. I also have an ancient 68030 Mac to test on, though I really do expect that one to be a dog - but I also know there are a lot of people out there who really cannot afford anything better.

The above numbers also include implementation of the work-reduction techniques Malcolm suggested, for final-block lengths of exactly 32 and 52 - these correspond to counting in words 7 and 12 respectively. The benchmark itself uses position 52, which is the optimal position. Actual performance boosts seem close to the 15% predicted.

Unfortunately, the most efficient implementation also causes a certain amount of duplication and spaghetti-ness in the code. Not all of the cores are updated yet, as it's taking quite some time to do so - I haven't done the MMX routines yet, for example. I'll only transmit an update when this is finished. However, here are some numbers from the newer PowerPCs as well:

PowerBook G4 (667MHz PowerPC 7450): 3000000 (was 2500000)
PowerBook G3 (400MHz PowerPC 750): 700000

Finally, here's a sample of the new detailed output at the end of the benchmark run:

Best minter: PowerPC Altivec Standard 1x4-pipe (2938959 hashes/sec)
Projected average times to mint:
  8 bits:     0.000 seconds (87.1 microseconds)
 10 bits:     0.000 seconds (348.4 microseconds)
 16 bits:     0.022 seconds
 20 bits:     0.357 seconds
 22 bits:     1.427 seconds
 24 bits:     5.709 seconds
 26 bits:    22.834 seconds
 28 bits:    91.337 seconds
 30 bits:   365.348 seconds (6.1 minutes)

It will also provide "hours" and "days" estimates for particularly slow machines if necessary - the "days" showed up when testing the 68K port under emulation on the 8100. The list of bit-counts it provides statistics for is easily adjusted by checking an array, though the code presently only supports up to 30 bits for this purpose.

The cores, however, support minting up to 64 bits, and can, if you're really that sadistic, be used for even higher counts by checking and retrying. They do require retrying to handle more than 2^32 iterations anyway. The provided calling routine will automatically handle retrying, although it may eventually run out of stack space if it is forced to recurse too far.

from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

Other related posts: