On 2004-08-08 19:15:44 [+0200], Alexander G. M. Smith wrote: > Talking generally, in the future it won't be as impressive. CPU speeds > go up faster than memory speeds. They've been doing that since the days of the Amiga 1000. That hasn't hindered SIMD processing from being one of the great successes in CPU design. > So if all your data (or batches of it) > can fit in the L1 cache on the CPU (a few kilobytes) and needs intensive > processing then it's great to use MMX. Otherwise you aren't saving quite > as much time. I've just written a benchmark to test out your assertion. I've chosen B_OP_ADD applied to two B_RGB32 bitmaps, that's one of the simplest graphics operations you can perform. Results in MegaPixels/sec., values rounded, as usual on mx XP2100+@1733MHz, DDR266: 800x600 100x100 (1200KB) (39KB) C integer 72 111 C integer w/ 73 137 4x loop unroll MMX, 70 87 no loop unroll 3DNow! 185 2614 4x loop unroll, PREFETCH/W SSE 185 2157 4x loop unroll, PREFETCHT0 Even for non-cacheable data and simple operations, SIMD processing (and use of data prefetch instructions) can give more than decisive advantages. The speed advantage of SIMD merely moves from impressive to ridiculous when working on cached data. Oh, and keep in mind that my machine has pretty slow RAM, modern Athlons/P4s have 50% more RAM bandwidth, which will make the SIMD advantage more pronounced on non-cached data. Of course these results apply only to bitmap processing, but I think it should be obvious that SIMD has rather serious potential. I've put the program with source up at <http://www.elenthara.de/BeOS/B_OP_ADD_Test.zip>, if anybody wants to look at it (it's just a quick hack, don't expect comments; if you have questions, contact me). I'd love benchmark results from a P4, as I'm very curios on how much it differs between SIMD and integer code. The program should auto-detect the supported SIMD sets, and run only appropriate routines; but the CPU ID routine has never been tested on PII/III/4s, so it might crash. Improved C routines would also be welcome; I don't think it can be improved much, but maybe somebody knows a trick or two. > Then there's the fact that newer memory systems are good > for sequential access, but horrible for reading data from random > addresses. Again that affects how you write your code. Obviously. If you're dealing with random data accesses SIMD will be useless, as you'll usually have dozens of cycles between the RAM accesses to do work; C will be more than sufficient, you could use BASIC. But when working on big chunks of contigous data, SIMD will rule supreme. > Whole books and > courses are available on odd optimization tricks to work around those > bottlenecks and other quirks. Yeah, but this isn't really the topic. We're talking about useful applications for SIMD code. And some of the commands included in 3DNow! and SSE can obviously help a lot for bypassing RAM limitations. How much this will gain in real-world code is another matter; but this can only be tested by implementing some test code, and benching it. > But one key thing is to measure your > results with an accurate timer, otherwise it's just wishful thinking. Of course. I use system_time() with processing runs of several seconds to get my readings; while still not perfect, it delivers good estimates. Currently I don't have the time to dig into using CPU hardware clocks and performance monitoring, though I'd love to. Bye, Chris