[hashcash] Re: Speed problem with 1.03 on Mac G4

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Mon, 16 Aug 2004 02:25:58 +0100

The delay problem is a combination of me mis-using Jonathan's fastlib
hashcash_benchtest function to time each implementation which runs on
the platform to test empirically which is best (and to correctly
allocate blame Jonathan recommended against doing this and instead
using the static table -- how do I do this btw? -- I gather it would
be better place-holder than calling hashcash_benchtest as is in pr3.)

Just call the initialisation function - it guesstimates the "best" function to use immediately after populating the core table. The benchmark overrides this if it is used. BTW, it sounds like you're calling the benchmark at least twice - surely it would be better to cache the result?


The other related problem is the old code assumes a function
hashcash_per_sec() can be called which measures hashcash computations
per second.

Also I think it would help if we used similar to the old approach
where you work on hashcash in a loop _until_ the timer changes (ie
measure the minimum timer resolution worth of work).

If it would help, I could whip up a relatively fast function to measure the speed of the currently-selected core. This would let you keep the present statistics while avoiding the huge startup penalty.


However this elapsed / wall-clock time
rather than process time so inaccurate on loaded systems, so we
switched to clock() however on linux I think I am finding this timer
is lower resolution.

But not so low resolution that it requires many seconds to get an even approximately accurate result. Most platforms have a clock() resolution of 1/60 (Classic Mac) or 1/100 (Linux/x86) sec, sometimes better like 1/1000 sec on Linux/Alpha. It should normally be easy to get a reliable result within 1/10 sec.


The core-selection benchmark takes a long time because it has to benchmark every core, not just one, and it has to use a work quantity that will take long enough to be accurate even on a very fast computer.

We need to do something about the compiler flags you mention.
eg. make target for different platforms, perhaps auto-detect platform
from environment variables accessed from make?  or minimally
documentation.  (I think I stomped on Jonathan's amd or altivec
options with p4 options... tut tut).

Here are some useful sets of compiler options:

X86 (generic, but won't work on 3/486 until I fix the MMX detection code)
CFLAGS = -O3 -funroll-loops -march=i386 -mcpu=pentium -mmmx


X86 (temporary generic, disables MMX to make it work on old chips)
CFLAGS = -O3 -funroll-loops -march=i386 -mcpu=i486

Despite the huge variety of X86 chips, the above two should be sufficient to obtain near-optimal performance on any of them. The MMX cores are almost entirely custom assembler, so the compiler can't optimise them (much) further. Unfortunately, this also means the MMX cores are only available to GCC (and derivative) compilers, as the Microsoft and Borland versions use a completely incompatible assembly format.

PowerPC (for MacOS X, which requires a G3 or newer)
CFLAGS = -O3 -funroll-loops -fno-inline -mcpu=750 -faltivec

PowerPC (for Linux, which can use a 604e - will also work on earlier desktop PPCs)
CFLAGS = -O3 -funroll-loops -fno-inline -mcpu=604e -maltivec -mabi=altivec


For any other architecture, including AMD64 (which includes MMX as standard)
CFLAGS = -O3 -funroll-loops -m{arch|cpu}=<whatever>


The PPC Altivec code does require either -fno-inline or dropping back to -O2 - this is because of some heavy custom assembler in one of the Altivec cores, which the compiler (incorrectly) tries to inline when possible. Suggested solutions that don't involve compiler flags are welcome.

Unlike MMX, one of the Altivec cores will compile under CodeWarrior (or any other Motorola-compliant compiler) while still obtaining near-optimal performance. The other two Altivec cores use GCC-specific assembler. All of the "scalar" cores are presently implemented entirely in ANSI C, so should be available to all compilers and platforms.

btw What speed is your powerpc -- it's impressively fast .. a little
faster than a 3.06Ghz P4.

The original mail mentioned he had a 1GHz iBook, which I believe uses a 7457 or 7455 G4 chip. The performance results are in the right ballpark, compared to my 667MHz 7450 which obtains over 3 million. I'd like to see results from a G5. :)


--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.


Other related posts: