Hi,
Christian Packmann wrote:
I've got a blur routine (3x3 matrix) for B_RGB32 bitmaps, which gives following results on my Athlon XP 2100+ (1733MHz) with DDR266 memory:
Bitmap 640x480, 1200 KB Bitmap 100x100, 9.76KB
Code MegaPixels/second MegaPixels/second C integer 33 35
MMX 80 125
3DNow! 110 134
Wow! I was sure there was at least a 2x difference. Thanks for these tests.
The MMX routine is faster by virtue of processing multiple values with one instruction. The 3DNow! routine adds data prefetching, so that the CPU preloads the next chunk of data while the current chunk is being processed. The C version could be improved slightly by using loop unrolling, which both MMX and 3DNow! use; but this would give 10-20% increase at best.
Similar speedups are likely for many bitmap operations which use alpha or blending. In some extreme cases the improvements might be far more spectacular, especially on the P4. The P4 design made many compromises in the integer engine in order to achieve high clock speeds - shifts and multiplies are very slow compared to other architectures (PIII, K7/8). This will hurt performance of integer code using these instructions; and especially in graphics processing you need shifts all the time to isolate and join color components. By using SIMD you can alleviate this problem, as the P4 delivers very good SIMD performance.
I have a P4. I am curios about the results. :-P
I'm not really a SIMD pro, but I'll gladly help with whatever I know. And I already have a few suggestions about data alignment of bitmaps, which would help SIMD coders a lot in writing efficient code.
I guess we should move this to interfacekit@xxxxxxxxxxxxx?
Let's continue here, maybe others can help too.
Adi.