[hashcash] Re: Speed problem with 1.03 on Mac G4
- From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
- To: hashcash@xxxxxxxxxxxxx
- Date: Mon, 16 Aug 2004 02:25:58 +0100
The delay problem is a combination of me mis-using Jonathan's fastlib
hashcash_benchtest function to time each implementation which runs on
the platform to test empirically which is best (and to correctly
allocate blame Jonathan recommended against doing this and instead
using the static table -- how do I do this btw? -- I gather it would
be better place-holder than calling hashcash_benchtest as is in pr3.)
Just call the initialisation function - it guesstimates the "best"
function to use immediately after populating the core table. The
benchmark overrides this if it is used. BTW, it sounds like you're
calling the benchmark at least twice - surely it would be better to
cache the result?
The other related problem is the old code assumes a function
hashcash_per_sec() can be called which measures hashcash computations
per second.
Also I think it would help if we used similar to the old approach
where you work on hashcash in a loop _until_ the timer changes (ie
measure the minimum timer resolution worth of work).
If it would help, I could whip up a relatively fast function to measure
the speed of the currently-selected core. This would let you keep the
present statistics while avoiding the huge startup penalty.
However this elapsed / wall-clock time
rather than process time so inaccurate on loaded systems, so we
switched to clock() however on linux I think I am finding this timer
is lower resolution.
But not so low resolution that it requires many seconds to get an even
approximately accurate result. Most platforms have a clock()
resolution of 1/60 (Classic Mac) or 1/100 (Linux/x86) sec, sometimes
better like 1/1000 sec on Linux/Alpha. It should normally be easy to
get a reliable result within 1/10 sec.
The core-selection benchmark takes a long time because it has to
benchmark every core, not just one, and it has to use a work quantity
that will take long enough to be accurate even on a very fast computer.
We need to do something about the compiler flags you mention.
eg. make target for different platforms, perhaps auto-detect platform
from environment variables accessed from make? or minimally
documentation. (I think I stomped on Jonathan's amd or altivec
options with p4 options... tut tut).
Here are some useful sets of compiler options:
X86 (generic, but won't work on 3/486 until I fix the MMX detection
code)
CFLAGS = -O3 -funroll-loops -march=i386 -mcpu=pentium -mmmx
X86 (temporary generic, disables MMX to make it work on old chips)
CFLAGS = -O3 -funroll-loops -march=i386 -mcpu=i486
Despite the huge variety of X86 chips, the above two should be
sufficient to obtain near-optimal performance on any of them. The MMX
cores are almost entirely custom assembler, so the compiler can't
optimise them (much) further. Unfortunately, this also means the MMX
cores are only available to GCC (and derivative) compilers, as the
Microsoft and Borland versions use a completely incompatible assembly
format.
PowerPC (for MacOS X, which requires a G3 or newer)
CFLAGS = -O3 -funroll-loops -fno-inline -mcpu=750 -faltivec
PowerPC (for Linux, which can use a 604e - will also work on earlier
desktop PPCs)
CFLAGS = -O3 -funroll-loops -fno-inline -mcpu=604e -maltivec
-mabi=altivec
For any other architecture, including AMD64 (which includes MMX as
standard)
CFLAGS = -O3 -funroll-loops -m{arch|cpu}=<whatever>
The PPC Altivec code does require either -fno-inline or dropping back
to -O2 - this is because of some heavy custom assembler in one of the
Altivec cores, which the compiler (incorrectly) tries to inline when
possible. Suggested solutions that don't involve compiler flags are
welcome.
Unlike MMX, one of the Altivec cores will compile under CodeWarrior (or
any other Motorola-compliant compiler) while still obtaining
near-optimal performance. The other two Altivec cores use GCC-specific
assembler. All of the "scalar" cores are presently implemented
entirely in ANSI C, so should be available to all compilers and
platforms.
btw What speed is your powerpc -- it's impressively fast .. a little
faster than a 3.06Ghz P4.
The original mail mentioned he had a 1GHz iBook, which I believe uses a
7457 or 7455 G4 chip. The performance results are in the right
ballpark, compared to my 667MHz 7450 which obtains over 3 million. I'd
like to see results from a G5. :)
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: chromi@xxxxxxxxxxxxxxxxxxxxx
website: http://www.chromatix.uklinux.net/
tagline: The key to knowledge is not to rely on people to teach you it.
Other related posts: