[openbeos] Re: app_server: MMX/SSE help wanted

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Tue, 10 Aug 2004 19:56:59 +0200

On 2004-08-10 00:44:38 [+0200], Adi Oanca wrote:

>     Here are your tests on a P4 2.6GHz HT:
 
> $ ./B_OP_ADD_Test 100 100 1
> Benchmarking C integer
> 101.05 MPixels/second
> 
> Benchmarking SSE, loop unrolling x4, PREFETCHT0
> 1600.00 MPixels/second
> 
> $ ./B_OP_ADD_Test 100 100 2
> Benchmarking C integer
> 109.09 MPixels/second
> 
> Benchmarking SSE, loop unrolling x4, PREFETCHT0
> 1476.92 MPixels/second
> =======================

Thanks for running the tests! One little note: you should do far more 
iterations on the benches, so that clock imprecision and random activity 
from background programs won't influence the results - if you look at the 
figures, you'll see a few discrepancies; with the 100x100 test, the C 
integer and SSE results differ significantly.
Sorry for not mentioning this in my post, I slapped this together in a 
hurry, and forgot to mention it.

Good parameters for your machine should be 800 600 1000 and 100 100 50000; 
that's what I used. And maybe use 1024x768 for the big images, your machine 
is so darn fast, that won't hurt.

But as most results seem to be coherent, this already gives us a good 
indication of what's possible.

> MMX performance a bit odd?

No, it's crippled intentionally, to show what'll happen when you use SIMD 
in the wrong way; SIMD is no automatic performance miracle, you still have 
to take a lot of care how you design and implement things.
The loop is totally unoptimized, it has 4 MMX commands as loop body doing 
the actual work, but the loop control consists of 4 integer commands - this 
negates any possible performance gain from using MMX, as the CPU is just as 
busy checking for end-of-loop as it is moving data.

The integer loops are faster because the loops have between a few dozen and 
100+ instructions in the loop body; the CPU can 'get up to speed' when 
working on continous code segments, before it performs loop control. The P4 
seems to be optimized for this, as it gains more from the unrolled integer 
loop than the Athlon.

Both 3DNow! and SSE loops have 16 instructions in the loop body, this seems 
to be a good mix for having short code yet fast execution; with more loop 
unrolling there should be some more performance gains.
But of course the performance mainly depends on the use of PREFETCHT0, 
which tells the CPU to preload RAM areas into cache a few dozen cycles 
before the program actually needs them; so the CPU rarely has to wait for 
data to process.

And about the results themselves... my first thought was that I had 
implemented serious bugs, but I'm afraid the program works correctly. :) 
With the P4 and fast memory, speed increases of 4x really seem doable, at 
least in some cases. Maybe more - if you recall my comments on alignment 
issues, it's /possible/ that the P4 was running into severe performance 
problems.

We'll get back to this later, when I have more routines to do tests with. 
Is there any alpha-drawing code in CVS yet? A SIMD implementation of that 
would be interesting.

Bye,
Chris

Other related posts: