[openbeos] Re: app_server: MMX/SSE help wanted

Marcus Overhagen wrote:
> Christian Packmann <Christian.Packmann@xxxxxx> wrote:

>> need to do that in a separate step with masking (value & 0xff).
> Thats wrong. Doing saturation with masking won't work. For example, the 
> value 256 (0x100) would be clipped to 1 this way, which gives the wrong
> result.
> 
> Thus you need to compare with <0 and >255, which creates ugly jump 
> instructions in the genereated assembly code, and is slow. MMX doesn't 
> need it, and is faster.

<blush> I can't C anymore... I used color clipping in C before, and never 
used masking. MMX addiction, likely.

I've corrected the code in B_OP_ADD_Test, and now it's much faster yet; I 
guess because I use empty bitmaps which seem to be all 0s, the Athlon can 
fold all those branches away and not perform any clipping operations at 
all. I'll have to think of a way to prefill sensible data into the bitmaps.

Anyway, there's a (non-portable) solution for integer code: CMOV, which can 
perform conditional moves depending on the current condition flags. It was 
introduced on the PentiumPro to prevent branches for these kind of simple 
operations.
If gcc could be made to use this for simple clipping tests etc., it would 
give significant improvements to such code. The current gcc doesn't seem to 
support it though, haven't found any reference in the docs.

> I had to implement such saturation code when wirting a color space 
> conversion from YCbCr420p(lanar) to RGB32 colorspace.
>
> [more on saturation snipped]

Thanks for the examples and explanations. It's funny that a lookup will be 
the fastest way of doing things on a modern CPU - do you test on a P4 or 
Athlon?

If you've gone through so many pains with this, I guess the code is in a 
critical path? Maybe you should try a x86 version using inline CMOVs, this 
might give a boost.
And if performance is really critical I could take a look at these routines 
to see if SIMD versions would be possible. But actual implementation would 
be far away, I've glanced at the code and it sure isn't trivial.

Bye,
Chris

Other related posts: