[hashcash] Re: Statistics of hashcash collisions

  • From: Adam Back <adam@xxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Tue, 25 May 2004 15:30:22 -0400

On Tue, May 25, 2004 at 07:01:40AM +0200, Anonymous wrote:
> Without further ado, here is the table:
> 
> v range     Probability
> -----------------------
>   0  - 1/16      6%
> 1/16 - 1/8       6%
> 1/8  - 1/4      10%
> 1/4  - 1/2      17%
> 1/2  - 1        24%
>   1  - 2        23%
>   2  - 4        12%
>   4  - 8         1.8%
>   8 or over      0.03%

This distribution is borne out by empirical tests also.

> This was pretty surprising to me, as it seems like I often spend more
> time waiting for a collision to happen in my tests.  Another peculiarity
> is that if we run for the time corresponding to v=1, there is actually
> a greater than 1/2 chance of having seen a hit.  The chance is 1 - 1/e
> or 63%.  

Nevertheless the mean or expected time is actually the nominal.  Ie as
given by the (discrete) geometric distribution where P(n) = q^{n-1}.p,
mean = 1/p variance = q/(p^2) and p = 1/2^b.  So mean = 2^b.  (Where b
is the unmber of bits).

> One way to smooth these values would be instead of calculating, say,
> a single 23-bit collision, calculate 8 20-bit collisions.

Correct.  This was proposed by Juels & Brainard.  Earlier I also
explored using 2-way and k-way parital collisions (such as used by
Rivest and Shamir in micromint).  The micromint collision function is
to set z bits to zero, and put bits into buckets until you find k of
them, constituting a k-way partial collision (k-inputs which give
partially matching output).

However there isn't much difference in variance between multiple
sub-puzzles and k-way (k-way is a little lower, but more complex to
code).  Also multiple sub-puzzle doesn't have any memory requirement,
where-as k-way does.


The downside, or trade-off with either of these approaches is that a)
it makes the verification k-times larger (for k sub-puzzles or k-way
collisions); and b) they are a constant plus k times larger.

Wrt to verification time you can deter time-wasting (attacker sending
one valid sub-puzzle, but having the verifier verify all of them) by
making them dependent in some way.  eg. example approach I was
considering for v1 format:

1:10,4:040525:foo::c82a215126afb2c2:16,970,119,23

(for 4 sub-puzzles each of size 10 bits)

where the collision is on the string including the previous one so 1st
is on

1:10,4:040525:foo::c82a215126afb2c2:

2nd on:

1:10,4:040525:foo::c82a215126afb2c2:16

and so forth.  This means the verifier will only verify one invalid
stamp worst case.


Clearly it would desirable if we could have a fixed cost (exact same
computation on each try), or at least bounded-probabilistic cost (as
provided by Juels & Brainard's client-puzzles approach).

The problem with the client-puzzles approach is that it is
interactive.  (The scheme is to send a challenge with a known solution
and fixed solution space to explore: H( x || 1234 ) = 5678 and user
must find x such that the equality is true.  So worst case it takes
the user 2^|x| operations.

Unfortunately it seems that most problems with self-evident solutions
(solutions that can be verified much more quickly than computed) and
fixed solution spaces are public key style operations.  So I give a
few examples of those in the paper, one being cube-root (which is the
non-interactive version of Dwork and Naor's square-root they proposed
in their original paper).  The downside with these approaches are:

- larger stamps (16kbit operations to slow things down much)
- more complex code (bignum operations)
- (relatively) more expensive verify (raising 16kbit number to 3rd
power costs more)

- scales badly over time (verification cost increases as computation
cost increases; stamp size increases as computation cost increases)

- note there is still some small variance due to the addition-chain
sliding window etc approaches that provide efficient modular
exponentiation

However for there is a range of computation sizes where cube-root
competes with multiple-sub-puzzle on variance, size and verification
cost.


If you or anyone has any insights into achieving fixed cost, or even
bounded probabilistic cost non-interactively without resorting to
public key operations, let me know.  I've been looking at it
on-and-off and have not found anything.

Adam

Other related posts: