[hashcash] Re: Speed problem with 1.03 on Mac G4

From: Adam Back <adam@xxxxxxxxxxxxxxx>
To: hashcash@xxxxxxxxxxxxx
Date: Mon, 16 Aug 2004 15:43:12 -0400
On Mon, Aug 16, 2004 at 02:25:58AM +0100, Jonathan Morton wrote:
> Just call the initialisation function - it guesstimates the "best" 
> function to use immediately after populating the core table.  

OK done.

> BTW, it sounds like you're calling the benchmark at least twice -

yes the code is supposed to use hc_per_second which calls
hashcash_per_second and caches the result so the work is only done
once.  An call directly to hashcash_per_second crept in, fixed.

> >Also I think it would help if we used similar to the old approach
> >where you work on hashcash in a loop _until_ the timer changes (ie
> >measure the minimum timer resolution worth of work).
> 
> If it would help, I could whip up a relatively fast function to measure 
> the speed of the currently-selected core.  This would let you keep the 
> present statistics while avoiding the huge startup penalty.

Yes that would help a lot, to use the current method I need a way to
do a known quantity of work.  I put up pr4 (still has problems, tho
less of them, but for discussion purposes):

        http://www.hashcash.org/source/current/hashcash-1.03/

long hashcash_per_sec( void )
{
#if defined( CHROMATIX )
    if ( !hashcash_init ) { hashcash_select_minter(); hashcash_init = 1; }
#endif
    clock_t t1, t2, t1c,t2c,tmp;
    double elapsed;
    unsigned long n_collisions = 0;
    char token[ MAX_TOK+1 ];
    word32 step = 100;

    /* wait for start of tick */

    t2 = clock();
    do {
        t1 = clock();
    } while ( t1 == t2 );
    t1c = t1;

    /* do computations for next tick */

    do {
        n_collisions += step;
        find_collision( "000101", "flame", 40, token, step, "", "", "" );
        t2c = t2 = clock();
        if ( t2c < t1c ) { tmp = t2c; t2c = t1c; t1c = tmp; }   
    } while ( (t2c - t1c) == 0 );

    if ( t2 < t1 ) { tmp = t2; t2 = t1; t1 = tmp; }

    fprintf( stderr, "collisions = %ld\n", n_collisions );

/* see how many us the tick took */
    elapsed = (double)(t2 - t1 )/CLOCKS_PER_SEC;
    return (word32) ( 1.0 / elapsed * (double)n_collisions
                      + 0.499999999 );
}

so this code is broken because it is always timing the legacy code
(not libfastmint).  It should have #if defined( CHROMATIX ) call
something which does some chunk of work.

(Note: the actual hashcash_mint() function calls hashcash_fastmint()
when CHROMATIX is defined).

> >However this elapsed / wall-clock time
> >rather than process time so inaccurate on loaded systems, so we
> >switched to clock() however on linux I think I am finding this timer
> >is lower resolution.
> 
> But not so low resolution that it requires many seconds to get an even 
> approximately accurate result.  Most platforms have a clock() 
> resolution of 1/60 (Classic Mac) or 1/100 (Linux/x86) sec, sometimes 
> better like 1/1000 sec on Linux/Alpha.  It should normally be easy to 
> get a reliable result within 1/10 sec.

So in the above code I did work for one clock resolution interval.

The output is like this:

collisions = 4400
speed: 440000 collision tests per second
collisions = 5800
speed: 580000 collision tests per second
collisions = 7400
speed: 740000 collision tests per second
collisions = 7200
speed: 720000 collision tests per second
collisions = 5000
speed: 500000 collision tests per second
collisions = 1400
speed: 140000 collision tests per second
collisions = 7300
speed: 730000 collision tests per second
collisions = 4400
speed: 440000 collision tests per second
collisions = 7300
speed: 730000 collision tests per second
collisions = 8100
speed: 810000 collision tests per second

accuracy +/- 50%!  (as you can see the resolution is 1/100 sec).  

I intentionally do a busy wait for timer tick before starting to be at
the beginning of an interval.

So I think you can only explain this variance by the counter not only
being output in 1/100th second units, but also massively erratic and
yet this is supposed to be cpu work accounting on a machine with a
clock rate 30 million times faster than that.  The work inside the
loop is purely deterministic, all that is left is process scheduling
(which I would think might be discounted from cpu work accounting),
and processor cache effects which I do not think can account for 50%
speed difference over such a large chunk of CPU cycles.

Weird.

I also tried timing for 1/10 sec eg instead of minimum timer interval
instead of this line:

    } while ( (t2c - t1c) == 0 );

using:

    } while ( (t2c - t1c) < CLOCKS_PER_SEC/10 );

and there the output is:

collisions = 74500
speed: 745000 collision tests per second
collisions = 73600
speed: 736000 collision tests per second
collisions = 66800
speed: 668000 collision tests per second
collisions = 66700
speed: 667000 collision tests per second
collisions = 67400
speed: 674000 collision tests per second
collisions = 74800
speed: 748000 collision tests per second
collisions = 74400
speed: 744000 collision tests per second
collisions = 70700
speed: 707000 collision tests per second
collisions = 70200
speed: 702000 collision tests per second
collisions = 70200
speed: 702000 collision tests per second

ie still pretty variable.

Actually the wall-clock time code was also pretty variable:

repeat 10 ./hashcash -s
speed: 393701 collision tests per second
speed: 555556 collision tests per second
speed: 366300 collision tests per second
speed: 378788 collision tests per second
speed: 529101 collision tests per second
speed: 555556 collision tests per second
speed: 375940 collision tests per second
speed: 170940 collision tests per second
speed: 510204 collision tests per second
speed: 534759 collision tests per second

(from hashcash-0.32)

The gettimeofday() call seems to give wall clock resolution of 5 usec
or maybe less (tight loop calling gettimeofday() until timers differ).

> The core-selection benchmark takes a long time because it has to 
> benchmark every core, not just one, and it has to use a work quantity 
> that will take long enough to be accurate even on a very fast computer.

Yes this is why I use the idiom of doing small chunks of work
repeatedly until the next interval starts -- just want the work chunk
to reasonably out weigh the function call overhead (which is
negligible compared to even 100 hashcash tests I would think) and the
rest can cope with arbitrary processor speeds.

At this point I am thinking /proc/cpuinfo and pentium tick counter!
This is crazy.

> Here are some useful sets of compiler options:

I also made some targets for these (by calling make recursively with
the chosen macros).

btw In enabling the static selection I seem to have re-exposed this
problem:

./hashcash -mb 16 -r foo
collisions = 63800
time estimate: 0 seconds (103 milli-seconds)
ERROR: requested 16 bits, reported 16 bits, got 1 bits using AMD64/x86 MMX 
Standard 1x2-pipe minter: "1:16:040816:foo::09bQn59OYq6RRjso:00000XtE"

Adam
Follow-Ups:
- [hashcash] Obtaining speed results for different CPUs (collision tests/sec)
  - From: John Honan
References:
- [hashcash] Speed problem with 1.03 on Mac G4
  - From: "Hal Finney"
- [hashcash] Re: Speed problem with 1.03 on Mac G4
  - From: Adam Back
- [hashcash] Re: Speed problem with 1.03 on Mac G4
  - From: Jonathan Morton
[hashcash] Re: Speed problem with 1.03 on Mac G4

Other related posts: