[openbeos] Re: app_server: MMX/SSE help wanted

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Mon, 09 Aug 2004 23:03:49 +0200

On 2004-08-08 19:15:44 [+0200], Alexander G. M. Smith wrote:

> Talking generally, in the future it won't be as impressive.  CPU speeds 
> go up faster than memory speeds. 

They've been doing that since the days of the Amiga 1000. That hasn't 
hindered SIMD processing from being one of the great successes in CPU 
design.

> So if all your data (or batches of it) 
> can fit in the L1 cache on the CPU (a few kilobytes) and needs intensive 
> processing then it's great to use MMX.  Otherwise you aren't saving quite 
> as much time.

I've just written a benchmark to test out your assertion. I've chosen 
B_OP_ADD applied to two B_RGB32 bitmaps, that's one of the simplest 
graphics operations you can perform. Results in MegaPixels/sec., values 
rounded, as usual on mx XP2100+@1733MHz, DDR266:

                  800x600       100x100
                 (1200KB)        (39KB)
C integer           72            111

C integer w/        73            137
4x loop unroll

MMX,                70             87
no loop unroll

3DNow!             185           2614
4x loop unroll,
PREFETCH/W

SSE                185           2157
4x loop unroll,
PREFETCHT0

Even for non-cacheable data and simple operations, SIMD processing (and use 
of data prefetch instructions) can give more than decisive advantages. The 
speed advantage of SIMD merely moves from impressive to ridiculous when 
working on cached data. Oh, and keep in mind that my machine has pretty 
slow RAM, modern Athlons/P4s have 50% more RAM bandwidth, which will make 
the SIMD advantage more pronounced on non-cached data.

Of course these results apply only to bitmap processing, but I think it 
should be obvious that SIMD has rather serious potential.

I've put the program with source up at 
<http://www.elenthara.de/BeOS/B_OP_ADD_Test.zip>, if anybody wants to look 
at it (it's just a quick hack, don't expect comments; if you have 
questions, contact me). I'd love benchmark results from a P4, as I'm very 
curios on how much it differs between SIMD and integer code. The program 
should auto-detect the supported SIMD sets, and run only appropriate 
routines; but the CPU ID routine has never been tested on PII/III/4s, so it 
might crash.

Improved C routines would also be welcome; I don't think it can be improved 
much, but maybe somebody knows a trick or two.

>  Then there's the fact that newer memory systems are good 
> for sequential access, but horrible for reading data from random 
> addresses.  Again that affects how you write your code.

Obviously. If you're dealing with random data accesses SIMD will be 
useless, as you'll usually have dozens of cycles between the RAM accesses 
to do work; C will be more than sufficient, you could use BASIC. But when 
working on big chunks of contigous data, SIMD will rule supreme.

> Whole books and 
> courses are available on odd optimization tricks to work around those 
> bottlenecks and other quirks. 

Yeah, but this isn't really the topic. We're talking about useful 
applications for SIMD code. And some of the commands included in 3DNow! and 
SSE can obviously help a lot for bypassing RAM limitations. How much this 
will gain in real-world code is another matter; but this can only be tested 
by implementing some test code, and benching it.

> But one key thing is to measure your 
> results with an accurate timer, otherwise it's just wishful thinking.

Of course. I use system_time() with processing runs of several seconds to 
get my readings; while still not perfect, it delivers good estimates. 
Currently I don't have the time to dig into using CPU hardware clocks and 
performance monitoring, though I'd love to.

Bye,
Chris

Other related posts: