Christian Packmann wrote on Mon, 09 Aug 2004 23:03:49 +0200: > Even for non-cacheable data and simple operations, SIMD processing (and use > of data prefetch instructions) can give more than decisive advantages. Looks like somewhere between 2 and 3 times speedup for large data. Sure are lots of shift instructions in the C code - that's what MMX does do all in one operation. I wonder if it would be faster or slower with byte pointers and math rather than shift operations to extract the bytes. I'd also check the generated code to make sure *src was not being reloaded for every operation (copy it to a local variable first in that case) and compile with optimization. Anyway, it's nice to see those actual numbers! - Alex