
|
[openbeos]
||
[Date Prev]
[08-2004 Date Index]
[Date Next]
||
[Thread Prev]
[08-2004 Thread Index]
[Thread Next]
[openbeos] Re: app_server: MMX/SSE help wanted
- From: Christian Packmann <Christian.Packmann@xxxxxx>
- To: openbeos@xxxxxxxxxxxxx
- Date: Mon, 09 Aug 2004 23:03:49 +0200
On 2004-08-08 19:15:44 [+0200], Alexander G. M. Smith wrote:
> Talking generally, in the future it won't be as impressive. CPU speeds
> go up faster than memory speeds.
They've been doing that since the days of the Amiga 1000. That hasn't
hindered SIMD processing from being one of the great successes in CPU
design.
> So if all your data (or batches of it)
> can fit in the L1 cache on the CPU (a few kilobytes) and needs intensive
> processing then it's great to use MMX. Otherwise you aren't saving quite
> as much time.
I've just written a benchmark to test out your assertion. I've chosen
B_OP_ADD applied to two B_RGB32 bitmaps, that's one of the simplest
graphics operations you can perform. Results in MegaPixels/sec., values
rounded, as usual on mx XP2100+@1733MHz, DDR266:
800x600 100x100
(1200KB) (39KB)
C integer 72 111
C integer w/ 73 137
4x loop unroll
MMX, 70 87
no loop unroll
3DNow! 185 2614
4x loop unroll,
PREFETCH/W
SSE 185 2157
4x loop unroll,
PREFETCHT0
Even for non-cacheable data and simple operations, SIMD processing (and use
of data prefetch instructions) can give more than decisive advantages. The
speed advantage of SIMD merely moves from impressive to ridiculous when
working on cached data. Oh, and keep in mind that my machine has pretty
slow RAM, modern Athlons/P4s have 50% more RAM bandwidth, which will make
the SIMD advantage more pronounced on non-cached data.
Of course these results apply only to bitmap processing, but I think it
should be obvious that SIMD has rather serious potential.
I've put the program with source up at
<http://www.elenthara.de/BeOS/B_OP_ADD_Test.zip>, if anybody wants to look
at it (it's just a quick hack, don't expect comments; if you have
questions, contact me). I'd love benchmark results from a P4, as I'm very
curios on how much it differs between SIMD and integer code. The program
should auto-detect the supported SIMD sets, and run only appropriate
routines; but the CPU ID routine has never been tested on PII/III/4s, so it
might crash.
Improved C routines would also be welcome; I don't think it can be improved
much, but maybe somebody knows a trick or two.
> Then there's the fact that newer memory systems are good
> for sequential access, but horrible for reading data from random
> addresses. Again that affects how you write your code.
Obviously. If you're dealing with random data accesses SIMD will be
useless, as you'll usually have dozens of cycles between the RAM accesses
to do work; C will be more than sufficient, you could use BASIC. But when
working on big chunks of contigous data, SIMD will rule supreme.
> Whole books and
> courses are available on odd optimization tricks to work around those
> bottlenecks and other quirks.
Yeah, but this isn't really the topic. We're talking about useful
applications for SIMD code. And some of the commands included in 3DNow! and
SSE can obviously help a lot for bypassing RAM limitations. How much this
will gain in real-world code is another matter; but this can only be tested
by implementing some test code, and benching it.
> But one key thing is to measure your
> results with an accurate timer, otherwise it's just wishful thinking.
Of course. I use system_time() with processing runs of several seconds to
get my readings; while still not perfect, it delivers good estimates.
Currently I don't have the time to dig into using CPU hardware clocks and
performance monitoring, though I'd love to.
Bye,
Chris
|

|