[hashcash] Re: Hashcash performance improvements

  • From: Jonathan Morton <chromi@xxxxxxxxxxxxxxxxxxxxx>
  • To: hashcash@xxxxxxxxxxxxx
  • Date: Sun, 30 May 2004 21:26:13 +0100

In case hashcash ever finds its way into an SMTP extension, there is a limit
of 512 characters on SMTP command lines although this can be extended by
"SMTP extensions", which I assume means some sort of negotiation between
client and server.

Well there are the guidelines. In any case, a limit should be named.
Standards normally have arbitrary limits written into them, as without them
no computer with finite memory could ever claim to be standards compliant.

I have a feeling the 512 bytes of SMTP should be the guideline, minus some reasonable amount for the command itself. That's over 6 80-column lines, BTW.

OTOH, just on the off-chance that there's a really good reason why we need to put over a half-kilobyte of data in a secured e-mail header (*cough*), perhaps we might word it as SHOULD be less than 500 and MUST be less than 5000. The latter is to discourage attacks, the former is for interoperability.

How about if the secondary counters are the same length as the primary?
Then they can be overlaid on the primary field, so there's less
padding involved - and leading zeroes could be omitted from the

I don't get what you're saying here. How can the counters be the same
length and also be optimally aligned within blocks? That is the sole reason
for padding in the first place.

The above example could be recoded:


Also, you seem to be reading meaning into the leading zeros, which could
just as easily have been leading 'A's had I used base64 for the counter.

Perhaps I should manually decode it for you - it's a kind of shorthand recording only the differences between four similar stamps:


Each stamp in the group has it's own validity, and these are added together like postage stamps - you can put three 10p stamps on to cover a 30p charge, at least in the UK. Because hashcash value is counted in bits, however, it only makes sense to attach power-of-two numbers of stamps to this particular envelope.

The "leading zeroes" is misleading - it doesn't matter what fills the space before the secondary counters, only that it's the same for each. The secondary counters simply overwrite the tail of the primary stamp.

I'd like to rewind this whole discussion a bit though and point out that I
don't see the need for the sub-puzzles. Is the only reason that it makes
minting time more predictable by reducing the variance? I can't see much
value in that, although maybe I will when I'm desperate to send an email
quickly :-)

I think that's the primary goal, yes. However, it does open up a potential resource-consumption hole at the receiver that wasn't there before. Each sub-stamp has to be tracked as if it were a normal one, and you can specify a rather large number of stamps using a relatively small amount of data - thus weakening my earlier argument about the DoS risks.

We'd need to think carefully about how to limit this damage, and whether the benefits outweigh the additional complexity in the spec.

An obvious solution is to limit the number of microstamps allowed per macrostamp - this automatically puts a lower limit on the value of each microstamp. Processing of a macrostamp should stop as soon as an invalid microstamp is found, but valid microstamps should be kept so that the same stamp can't be reused in the same attack.

OK, I'm just sounding off here, it's not something I'm actually interested in pushing. Since we're already at the stage where all "respectable" processors take only a few seconds on average to produce a full-value stamp, is the variance really a problem?

But here's yet another perspective: the Apple Human Interface Guidelines. This states that where possible, you should provide some indication of progress while performing a time-consuming task (that is, likely to take more than 1 second). With a monolithic stamp, this is near impossible - the best you can do is a beachball/hourglass pointer and a barber-pole, along with a blithe estimate of the average-case time. But with microstamps, you can show how many you've done so far, and actually make a reasonable prediction of when it'll finish, which is much more helpful to the ordinary user.

The optimised minter then overwrites part of the pre-assembled SHA-1
block with the counter.

It's the overwriting of random that I find distasteful. The sole purpose
of the minter should be to append a counter that works. It doesn't even
need to care about the format of the string it's been given to work on.

Mmmm... I think I disagree here. I consider the minter to be the "value add" part of the program. The random field and the counter work in concert to provide value.

If the minter only handled the counter, and it was presented with the same "random" field, it would produce exactly the same stamp - which would then be considered invalid and therefore worthless by the recipient.

Now, the counting part of the minter doesn't actually overwrite the random data. It overwrites a blank field, which is provided after the random data for that purpose. So don't worry, there's no "waste" of entropy going on!

Now, admittedly, padding with random data does potentially consume more
entropy from the system's RNG,

Ah, this was one of my original points against, but I removed it as it
wasn't very convincing. The random field can easily be padded with
non-random data, and who's going to prove you didn't generate it randomly?

Entirely true - you could take 96 bits of data from /dev/urandom (making a round 16 characters of base64) and repeat it as much as necessary.

I'll sum up my position by saying that the minting function's job is to
append a counter to some string, without parsing it in any way, within the
restrictions of the character set and total length limit (both of which are
being discussed elsewhere). Tricks such as padding which can be applied to
the counter string are fair game, modifying the string passed in is
definitely bad form.

Ah, another misunderstanding I think. I'm only appending things to the string, not overwriting any of it. The random stuff goes first, of course.

Here's an example of input and output, using the benchmark harness:

static const unsigned int test_bits = 20;
static const char *test_string =
static const int test_tail = 55; /* must be less than 56 */
for(i=0; i < num_minters; i++) {
/* set up SHA-1 block */
strncpy(block, test_string, SHA1_INPUT_BYTES);
block[test_tail] = 0x80;
memset(block+test_tail+1, 0, 59-test_tail);
PUT_WORD(block+60, test_tail << 3);

/* Run minter, with clock running */
end = clock();
while((begin = clock()) == end)
got_bits = minters[i].func(test_bits, block, SHA1_IV, test_tail, 1 << 30);
end = clock();
elapsed = (end-begin) / (double) CLOCKS_PER_SEC;

The above is actually simulating an input of "0:040404:foo@xxxxxxxx:" to the regular minting routine, except that it appends a fixed "random" string rather than a real one in order to provide a fair comparison between the cores. And the output?

  2621997 AMD64/x86 MMX Standard 1x2-pipe
    Solution:   0:040404:foo@xxxxxxxx:0123456789abcdef0000000000001HDvz
    Iterations: 21290621
    Time taken: 8.120

There is a slight difference in the real minting routine, which actually assumes it's going to be fed a v1 stamp and therefore puts a separator between the random and count fields. I still don't see the value in that separator, especially if the double-spend database is going to contain the hash rather than the random field. :)

Bear in mind that for a single-pipe routine running on a superscalar
processor, the XOR and shift calculations are essentially free, because
they can very easily be overlapped with the other half of each round

A fair point. I haven't implemented any of this yet myself so I'm not
familiar with what overlaps well. Still, eliminating operations is
generally good as sometimes those "free" instructions can be used to get
another pipeline in. And not every processor is superscalar, the vector
unit on a G4 (7450) effectively isn't as it only does one simple integer op
per cycle.

Perhaps, and some machines (grrr, MMX) don't have a rotate instruction, which makes the XOR stage about twice as expensive in both instructions and registers. I ran into that problem *after* writing that post. GCC decided to be braindead about it, too, which made it a real problem and not just an annoyance.

What I did find was that I could calculate any block of four W values with only one data dependency, which I was able to overlap with two-way parallelism. A block of three can be done entirely independently of each other. I also found that W, K, S(5,A) and E could be summed together while waiting for the result of F(B,C,D), and that these two processes seem to have similar latencies.

If I'd had more than 8 registers to work with, I'd have put the W computation in parallel with the sum and function as well, making a three-way parallel operation that should saturate most modern machines even without double-piping. Indeed, I plan to do exactly that with PowerPC scalar and Altivec optimisations, complete with a circular W buffer that lives entirely in core ... perhaps tomorrow. Who knows, my TiBook may yet regain the lead over Adam's P4.

from:     Jonathan "Chromatix" Morton
mail:     chromi@xxxxxxxxxxxxxxxxxxxxx
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

Other related posts: