[openbeos] Re: app_server: MMX/SSE help wanted

  • From: Adi Oanca <e2joseph@xxxxxxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Tue, 10 Aug 2004 19:03:46 +0300

Hi,

Christian Packmann wrote:
I'm not really a SIMD pro, but I'll gladly help with whatever I know. And I already have a few suggestions about data alignment of bitmaps, which would help SIMD coders a lot in writing efficient code.

Good, let's hear them. Before that: do you want to write some code for Haiku project?

Absolutely. To do SIMD coding for a good purpose would make me very happy.

Good. Then we will go into deeper discussions soon. Thank you.


About data alignment issues (short version):

Most CPUs prefer if they can perform reads/writes on natural alignmnent borders, i.e. a boundary of 2 bytes for a word (int16), 4 bytes for double word (int32), 8 bytes for quad word (abstract MMX datatype), etc.

When doing unaligned accesses, it'll take the CPU some extra cycles to perform the read/write operations. As these delays happen very often when reading lots of data, this will incur a significant slowdown.
AFAIK the worst case is a P4 doing an access across a 16- or 64-byte boundary; Intels docs state a penalty of up to the pipeline depth. The P4s pipelines have a length of 20-30 stages (depending on P4 model), and if you run into a 20 or 30 cycle delay... that's very bad.
But even if you're 'only' loosing a few cycles during each mem access that can hurt performance quite badly.

OK, we have to avoid that.


The problem with current BeOS is that it doesn't provide any kind of control over alignment when allocating bitmaps, they'll only be aligned to 4-byte boundaries. In order to get optimum memory throughput, I have to write special code which reads 32bit values until I reach a well-aligned address (8/16 bytes for MMX/SSE), then do full-width accesses until there's only a 'remainder' of data left which has to be read in 32bit chunks again. The resulting code is messy.

I imagine.


So from the SIMD coders perspective, it would be very good if Haiku would offer some control over data alignment for bitmap allocations.
This includes not only the base address of the bitmap, but should ideally extend to each bitmap row, in cases where each row has to be processed separately (e.g. by blur routines:).


Of course there'd be some 'waste', but I don't think this would matter too much on modern systems. Binary compatibility shouldn't be a problem either, as the BeBook already says that BytesPerRow() are decisive on determing a bitmaps actual size.

This might be implemented by an additional constructor without too much fuss, I'd guess. It would be great if you could implement this, as this would make SIMD coding much easier and less bug-prone.

OK, you'll have that. Now, I have to see how. :-) Axel, can you help?


Adi.

Other related posts: