[openbeos] Re: app_server: MMX/SSE help wanted

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Mon, 09 Aug 2004 22:47:44 +0200

On 2004-08-08 14:19:19 [+0000], Adi Oanca wrote:

>     What about SSE, SSE2, SSE3? what can you tell us?
>     Knowing they use 128bit registers, do they deliver a 4x performance 
> gain over the CPU?

In some cases, probably. I wouldn't count on that being the general case, 
though. But even if it's only 2-3x, that's still a serious speedup which 
can be had for 'free'.

> These have support for floating point instructions isn't it?

All SSE instruction sets and 3DNow! offer this. Probably very useful for 
doing OP_ALPHA stuff, and all other cases where you want to mix MMX and FP 
operations. I can't judge how useful that will turn out, that really 
depends on the particular tasks to be performed.
 
>> I'm not really a SIMD pro, but I'll gladly help with whatever I know. 
>> And I already have a few suggestions about data alignment of bitmaps, 
>> which would help SIMD coders a lot in writing efficient code.
 
> Good, let's hear them. Before that: do you want to write some code 
> for Haiku project?

Absolutely. To do SIMD coding for a good purpose would make me very happy.


About data alignment issues (short version):

Most CPUs prefer if they can perform reads/writes on natural alignmnent 
borders, i.e. a boundary of 2 bytes for a word (int16), 4 bytes for double 
word (int32), 8 bytes for quad word (abstract MMX datatype), etc.

When doing unaligned accesses, it'll take the CPU some extra cycles to 
perform the read/write operations. As these delays happen very often when 
reading lots of data, this will incur a significant slowdown.
AFAIK the worst case is a P4 doing an access across a 16- or 64-byte 
boundary; Intels docs state a penalty of up to the pipeline depth. The P4s 
pipelines have a length of 20-30 stages (depending on P4 model), and if you 
run into a 20 or 30 cycle delay... that's very bad.
But even if you're 'only' loosing a few cycles during each mem access that 
can hurt performance quite badly. 

The problem with current BeOS is that it doesn't provide any kind of 
control over alignment when allocating bitmaps, they'll only be aligned to 
4-byte boundaries. In order to get optimum memory throughput, I have to 
write special code which reads 32bit values until I reach a well-aligned 
address (8/16 bytes for MMX/SSE), then do full-width accesses until there's 
only a 'remainder' of data left which has to be read in 32bit chunks again. 
The resulting code is messy.

So from the SIMD coders perspective, it would be very good if Haiku would 
offer some control over data alignment for bitmap allocations.
This includes not only the base address of the bitmap, but should ideally 
extend to each bitmap row, in cases where each row has to be processed 
separately (e.g. by blur routines:). 

Of course there'd be some 'waste', but I don't think this would matter too 
much on modern systems. Binary compatibility shouldn't be a problem either, 
as the BeBook already says that BytesPerRow() are decisive on determing a 
bitmaps actual size.

This might be implemented by an additional constructor without too much 
fuss, I'd guess. It would be great if you could implement this, as this 
would make SIMD coding much easier and less bug-prone.

Bye,
Chris

Other related posts: