[haiku-development] Optimizing Painter::_DrawBitmapBilinearCopy32 (was: Re: ShowImage patch)

From: Christian Packmann <Christian.Packmann@xxxxxx>
To: haiku-development@xxxxxxxxxxxxx
Date: Wed, 11 Mar 2009 17:59:32 +0100

Stephan Aßmus - 2009-03-09 11:41 :

Christian Packmann schrieb:

However, I've realized one thing you might have tried:loop unrolling, that is, processing two pixels in one loop iteration.
Thanks for the hint, I'll try to incorporate this next time. For themoment, I won't touch the code, don't want to step on your toes! :-)

I may test this on my own for comparative benchmarking. If I do this, wecan integrate that into your code.

2. How much faster is the processing variant optimizeForLowFilterRatio?[...]
I am sure there was an "ok" speed up which made it worthwhile. The speedup is only there if the simpler cases of the inner loops are frequentlymet, that's why it checks the scaling factors in the beginning and hastwo completely separate versions. But the speed up was not /substantial/or anything like that. So you should be fine just concentrating on thegeneric version.

Good. Very good, even - this is turning out a bit more complicated than Iimagined.

3. Just to make sure I understand this right (I'm 99% sure, but...):
When you perform the weighting with
    (s[0] * wLeft + s[4] * wRight) * wTop
the inner part should stay safely in int16 range (actually, uint8) aswLeft+wRight = 255. The multiplication with wTop takes this to themaximum of uint16 range, i.e. 65536.
The range of the inner part is [0..65025]. Multiplied by wTop theresulting range is [0..16581375]. That's why it's using uint32 andshifting >> 16. The shifting is obviously not as precise as / 255. orrather / (255 * 255) aka 65025. So if divisions are cheap enough in SSE,you should use those instead.

Divisions are never cheap. Don't use them, ever. They are evil. :-) SSEdoesn't support division anyway, only computation of reciprocals which youcan use with multiplication to achieve the same effect.

But as the divisor is fixed in our case, we can try to find a suitablereciprocal for fixed-point calculations. It so happens we have a perfectmatch in the 8-bit range, as 65536/65025 * 2^7 == 129.00589.

So we can do
     (component_value * 129) >> 23
to approximate the division by 65025.

Check in Python:
for a in range(4000000, 4400000, 15000):
       print a / 65025,
       print a >> 16,
       print (a*129) >> 23

Seems to work well. You can add this to the integer code to get better

precision, if you can live with the three additional MULs. It should notexceed the valid uint32 range, as the sum of the wTop and wBottom termsshould be <16581375, *129 is <2138997375. Otherwise a down-shift of a fewbits prior to the multiplication would be required.

This is really important - if I can do most of the ops within int16range, I can use 8-way parallelism (8x16 bit in a 128-bit register)instead of 4-way parallelism.[...]
Two options. Either you really want it to stay in uint16 range. Then youcould simply do another division inbetween, before the row weights aremultiplied. That means double the divisions (or shifts) in the innerloop, but tow times as many pixels. Or you half the number of pixels.Whatever is faster.

Ah well. This is all moot for now; I wrote a pure SSE2 16-bit version, butthis was very buggy and needed so many data conversions that I ended upwith too many instructions within the loop.

I've scrapped that version and am concentrating on pure 32-bit versionsfor the moment. However, for SSE < 4.1 I have to use float, as there's nonormal 32-bit integer multiplication in lower SSE versions (sometimes I*hate* SSE). So the SSE 2 version will actually perform much like theroutine I wrote from the ShowImage code. It should still be fast, but therelative speedup won't be as fast as with ShowImage, as your code onlycomputes three components per pixel, not four.

I will probably also do a SSSE3 version, as SSSE3 adds the PSHUFBinstruction which allows much more efficient data shuffling; this willsave a few instructions. As the Atom supports SSSE3, it should profit fromthis. Won't be much, 5-10%, but on an Atom every cycle counts. And there'sno big rewrite needed, just exchange of a few instructions.

I'll also look into the 16-bit approach again, but with a MMX routine.There are two good reasons for that:

1. Many CPUs lack SSE2, including Pentium III, Athlon, AthlonXP and olderVIA processors. Offering high-performance code for them would be nice.

2. Some CPUs with SSE2 aren't actually fast in executing SSE2 code. Fromsome tidbits on the Handbrake forums I gather that some CPUs can executeMMX code faster than equivalent SSE2 code. I don't know for what mixtureof code and CPUs this applies, but we can only find out with comparativebenchmarking. This information will be important to make future decisionsabout desirable vector codepaths.

A MMX version may suffer from lower precision, but on slower systems thismight be an acceptable tradeoff for the speed gains.

[...]
I'm happy that you work on this!

I already noticed your enthusiasm. ;-) It's nice to know ones work isappreciated, quite apart from the fact that I enjoy doing some low-levelhacking again.


> I won't run out of app_server code to

point you to, once you have that filter working... :-D


I was waiting for you to say that! :-P

Ah, but one more thing. I'm currently only writing code for the innerx-processing, and have encapsulated that into a routine; this is the loop


    for (int32 x = xIndexL; x <= xIndexMax; x++) {

at line 2332 in the current version. The (draft of the) C prototype is

extern "C" void biscale_xloop_sse2(uint8_t *src, uint8_t *d,
               FilterInfo *xWeights,
               uint32_t xIndexL, uint32_t xIndexMax,
               uint32_t wTop, uint32_t wBottom, uint32_t srcBPR);

I'd like to keep this in a subroutine at this place, i.e. the y-loop staysas C++ code, only the x-loop is done in assembly. I'd prefer this approachto keep the assembly to a minimum - especially as there will be severalvariants of the same routine to write and to maintain. This also meansthat the processing of the last column and last row stay in C++, but theseshould only be a fraction of the methods total runtime.

I /think/ this limit to the x-loop should be no performance problem inmost cases. IIUC the length of the loop will depend on the currentlyprocessed clipping rectangle, this should normally be several pixels wide.But if the clipping rectangles get very small horizontally, the vectorroutines will perform worse than the C++ code due to the overhead of thefunction call and internal setup. If you think this will be a problem,I'll change the scope of the routine to include the y-loop.


Christian

Follow-Ups:
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32 (was: Re: ShowImage patch)
  - From: Rene Gollent
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32 (was: Re: ShowImage patch)
  - From: Stephen Deken
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32 (was: Re: ShowImage patch)
  - From: Axel Dörfler

References:
- [haiku-development] Re: ShowImage patch
  - From: Christian Packmann
- [haiku-development] Re: ShowImage patch
  - From: Stephan Aßmus
- [haiku-development] Re: ShowImage patch
  - From: Christian Packmann
- [haiku-development] Re: ShowImage patch
  - From: Stephan Aßmus
- [haiku-development] Re: ShowImage patch
  - From: Christian Packmann
- [haiku-development] Re: ShowImage patch
  - From: Stephan Aßmus

[haiku-development] Optimizing Painter::_DrawBitmapBilinearCopy32 (was: Re: ShowImage patch)

Other related posts: