On 2004-08-10 14:52:32 [+0200], Alexander G. M. Smith wrote: > Christian Packmann wrote on Mon, 09 Aug 2004 23:03:49 +0200: >> Even for non-cacheable data and simple operations, SIMD processing (and >> use of data prefetch instructions) can give more than decisive >> advantages. > Looks like somewhere between 2 and 3 times speedup for large data. On my system with its slow RAM; P4s with fast RAM are a different kind of breed, the same will go for Athlon64s. So on modern systems a speedup of 4 times seems more likely. > Sure are lots of shift instructions in the C code - that's what MMX does > do all in one operation. Not quite; MMX can access the bytes as single operands and perform the addition on all 8 values in register at once, it has no need for shifting any values - this is a huge advantage. And additionally it can do saturated additions, i.e. all values >255 are automatically clipped to 255; in C you need to do that in a separate step with masking (value & 0xff). > I wonder if it would be faster or slower with > byte pointers and math rather than shift operations to extract the bytes. Good idea about the byte pointers, I just tested this and while it gives a marginal +2% improvement for RAM data, it's +30% for cache. You can't use byte arithmetic though, as x86 has no saturated integer addition; any overflow would give garbage results. But ADDs are usually heavily optimized nowadays and should execute in 1 cycle irregardless of width. > I'd also check the generated code to make sure *src was not being > reloaded for every operation (copy it to a local variable first in that > case) and compile with optimization. The byte pointer version uses a local var, so this shouldn't be a problem. And I had opt=full from the beginning. I'll clean up the program a bit and upload a new version, hopefully by tomorrow. > Anyway, it's nice to see those actual numbers! A pleasure! :) Bye, Chris