[hashcash] Statistics of hashcash collisions

  • From: Anonymous <cripto@xxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx, mail2news@xxxxxxxxx
  • Date: Tue, 25 May 2004 07:01:40 +0200 (CEST)

One of the surprising properties of hashcash type collisions is the
variation in how long it takes to find them.  A collision which takes
a nominal or expected time of 1 minute sometimes seems to happen in
just a few seconds, and other times drags on for several minutes.
Here is some information on the statistical nature of this variation.
(Note: these are not actually collisions, but rather outputs of the hash
function which are distinguished by having several bits in predefined
states, for example a certain number of leading zero bits.)

The time for hashcash collisions of a substantial number of bits,
10 or more, can be represented by a Poisson distribution.  This is a
distribution characteristic of a low-probability event which has many
independent attempts at success.  For such collision sizes, this is
exactly how hashcash works.  Each individual hash has a low probability
of success, but the program calculates many hashes until it finds a hit.

Call v the product of the probability of a success on each trial times
the number of trials.  For a 20 bit hashcash collision, v of 1 would
correspond to approximately 1 million trials.  Then the chance of having
exactly n hits in time v is:

        P_v(n) = v^n * exp(-v) / n!

Now, with hashcash we don't run past our first hit.  So for this analysis
I will calculate the probability of 1 - P_v(0) for various values of v.
This will tell us the probability that we will have seen a hit by that
value of v.  By taking the difference of successive values this will
tell us the probability of a hit in various intervals of v.

Without further ado, here is the table:

v range     Probability
-----------------------
  0  - 1/16      6%
1/16 - 1/8       6%
1/8  - 1/4      10%
1/4  - 1/2      17%
1/2  - 1        24%
  1  - 2        23%
  2  - 4        12%
  4  - 8         1.8%
  8 or over      0.03%

To use an example, suppose you have a computer and a collision size which
you calculate (based on the counts needed and the speed of the iteration
in counts/second) will take 1 minute on average to create a collision.
Then there is a 6% chance it will take less than 4 seconds; 6% for 4-7
seconds; 10% for 7-15 seconds; 17% for 15-30 seconds; 24% for 30-60
seconds; 23% for 1-2 minutes; 12% for 2-4 minutes, 1.8% for 4-8 minutes,
and roughly 0 for any time over that.

This was pretty surprising to me, as it seems like I often spend more
time waiting for a collision to happen in my tests.  Another peculiarity
is that if we run for the time corresponding to v=1, there is actually
a greater than 1/2 chance of having seen a hit.  The chance is 1 - 1/e
or 63%.  So the median time to a collision is somewhat less, about 69%
of the nominal time.  In the one minute nominal trial case (calculated
on the basis of trials/second), the median time would be 42 seconds.

One way to smooth these values would be instead of calculating, say,
a single 23-bit collision, calculate 8 20-bit collisions.  You'd have
to make the hashcash field bigger to hold 8 collisions rather than
just one.  But this would even out the compute time by averaging over
several attempts, making the time more predictable.  My calculations
indicate that with 8 collisions, there is a 5% chance of it taking half
as long as it "should", and a 1% chance of it taking twice as long,
compared to 39% and 14% respectively with a single collision.

Other related posts: