[openbeos] Re: app_server: MMX/SSE help wanted

  • From: "Marcus Overhagen" <ml@xxxxxxxxxxxx>
  • To: <openbeos@xxxxxxxxxxxxx>
  • Date: Thu, 12 Aug 2004 09:45:24 +0200

Christian Packmann <Christian.Packmann@xxxxxx> wrote:

> > value 256 (0x100) would be clipped to 1 this way, which gives the wrong
> > result.
I would like to correct my previous mistake, 256 obviously gets clipped to 0, 
while
257 (0x101) would be clipped to 1.

> <blush> I can't C anymore... I used color clipping in C before, and never
> used masking. MMX addiction, likely.
Oh thats no problem, perhaps if you write enough MMX code you will be cured.

> Anyway, there's a (non-portable) solution for integer code: CMOV, which can
Thats interesting, but I guess all processors who support it also do MMX.
However, it might be interesting to try wrinting some different routines, like 
MMX,
SSE2 and depending on the processor, using a function pointer to call the 
fastest one.

> Thanks for the examples and explanations. It's funny that a lookup will be
> the fastest way of doing things on a modern CPU - do you test on a P4 or
> Athlon?
The lookup code has the advantage of no jumps and no multiplications in the 
inner
loop.

> If you've gone through so many pains with this, I guess the code is in a
> critical path? Maybe you should try a x86 version using inline CMOVs, this
Yes, it's used when plaing back videos with the new media kit on a RGB32
display. This is done if YCbCr Overlay is already in use by another video played
at the same time, or not avialable at all. It will be called for every frame of 
the video.

> And if performance is really critical I could take a look at these routines
> to see if SIMD versions would be possible. But actual implementation would
> be far away, I've glanced at the code and it sure isn't trivial.
A SIMD implementation would be nice. The code is easier if you look at the
non table base gfx_conv_YCbCr420p_RGB32_c function. To force the
compiler into better optimization, a few variables have been used twice.
Basically,  you have 3 input pointers, to Y (pi1), Cb (pi2) and Cr (pi3) 
components.
It then gets a little complicated as you process even and odd lines at the same 
time,
as the YCbCr data is downsampled. But the basic calculation for each pixel is:

R = 1.164(Y ­- 16) + 1.596(Cr -­ 128)
G = 1.164(Y - 16) -­ 0.813(Cr ­- 128) -­ 0.392(Cb ­- 128)
B = 1.164(Y ­- 16) + 2.017(Cb ­- 128)

Mit freundlichen Grüßen

Marcus Overhagen


Other related posts: