[haiku-development] Optimizing Painter::_DrawBitmapBilinearCopy32 (was: Re: ShowImage patch)

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Wed, 11 Mar 2009 17:59:32 +0100

Stephan Aßmus - 2009-03-09 11:41 :
Christian Packmann schrieb:

However, I've realized one thing you might have tried: loop unrolling, that is, processing two pixels in one loop iteration.

Thanks for the hint, I'll try to incorporate this next time. For the moment, I won't touch the code, don't want to step on your toes! :-)

I may test this on my own for comparative benchmarking. If I do this, we can integrate that into your code.

2. How much faster is the processing variant optimizeForLowFilterRatio? [...]

I am sure there was an "ok" speed up which made it worthwhile. The speed up is only there if the simpler cases of the inner loops are frequently met, that's why it checks the scaling factors in the beginning and has two completely separate versions. But the speed up was not /substantial/ or anything like that. So you should be fine just concentrating on the generic version.

Good. Very good, even - this is turning out a bit more complicated than I imagined.

3. Just to make sure I understand this right (I'm 99% sure, but...):
When you perform the weighting with
    (s[0] * wLeft + s[4] * wRight) * wTop
the inner part should stay safely in int16 range (actually, uint8) as wLeft+wRight = 255. The multiplication with wTop takes this to the maximum of uint16 range, i.e. 65536.

The range of the inner part is [0..65025]. Multiplied by wTop the resulting range is [0..16581375]. That's why it's using uint32 and shifting >> 16. The shifting is obviously not as precise as / 255. or rather / (255 * 255) aka 65025. So if divisions are cheap enough in SSE, you should use those instead.

Divisions are never cheap. Don't use them, ever. They are evil. :-) SSE doesn't support division anyway, only computation of reciprocals which you can use with multiplication to achieve the same effect.

But as the divisor is fixed in our case, we can try to find a suitable reciprocal for fixed-point calculations. It so happens we have a perfect match in the 8-bit range, as 65536/65025 * 2^7 == 129.00589.
So we can do
     (component_value * 129) >> 23
to approximate the division by 65025.

Check in Python:
for a in range(4000000, 4400000, 15000):
       print a / 65025,
       print a >> 16,
       print (a*129) >> 23

Seems to work well. You can add this to the integer code to get better
precision, if you can live with the three additional MULs. It should not exceed the valid uint32 range, as the sum of the wTop and wBottom terms should be <16581375, *129 is <2138997375. Otherwise a down-shift of a few bits prior to the multiplication would be required.

This is really important - if I can do most of the ops within int16 range, I can use 8-way parallelism (8x16 bit in a 128-bit register) instead of 4-way parallelism. [...]

Two options. Either you really want it to stay in uint16 range. Then you could simply do another division inbetween, before the row weights are multiplied. That means double the divisions (or shifts) in the inner loop, but tow times as many pixels. Or you half the number of pixels. Whatever is faster.

Ah well. This is all moot for now; I wrote a pure SSE2 16-bit version, but this was very buggy and needed so many data conversions that I ended up with too many instructions within the loop.

I've scrapped that version and am concentrating on pure 32-bit versions for the moment. However, for SSE < 4.1 I have to use float, as there's no normal 32-bit integer multiplication in lower SSE versions (sometimes I *hate* SSE). So the SSE 2 version will actually perform much like the routine I wrote from the ShowImage code. It should still be fast, but the relative speedup won't be as fast as with ShowImage, as your code only computes three components per pixel, not four.

I will probably also do a SSSE3 version, as SSSE3 adds the PSHUFB instruction which allows much more efficient data shuffling; this will save a few instructions. As the Atom supports SSSE3, it should profit from this. Won't be much, 5-10%, but on an Atom every cycle counts. And there's no big rewrite needed, just exchange of a few instructions.

I'll also look into the 16-bit approach again, but with a MMX routine. There are two good reasons for that:

1. Many CPUs lack SSE2, including Pentium III, Athlon, AthlonXP and older VIA processors. Offering high-performance code for them would be nice.

2. Some CPUs with SSE2 aren't actually fast in executing SSE2 code. From some tidbits on the Handbrake forums I gather that some CPUs can execute MMX code faster than equivalent SSE2 code. I don't know for what mixture of code and CPUs this applies, but we can only find out with comparative benchmarking. This information will be important to make future decisions about desirable vector codepaths.

A MMX version may suffer from lower precision, but on slower systems this might be an acceptable tradeoff for the speed gains.

[...]
I'm happy that you work on this!

I already noticed your enthusiasm. ;-) It's nice to know ones work is appreciated, quite apart from the fact that I enjoy doing some low-level hacking again.

> I won't run out of app_server code to
point you to, once you have that filter working... :-D

I was waiting for you to say that! :-P


Ah, but one more thing. I'm currently only writing code for the inner x-processing, and have encapsulated that into a routine; this is the loop

    for (int32 x = xIndexL; x <= xIndexMax; x++) {

at line 2332 in the current version. The (draft of the) C prototype is

extern "C" void biscale_xloop_sse2(uint8_t *src, uint8_t *d,
               FilterInfo *xWeights,
               uint32_t xIndexL, uint32_t xIndexMax,
               uint32_t wTop, uint32_t wBottom, uint32_t srcBPR);

I'd like to keep this in a subroutine at this place, i.e. the y-loop stays as C++ code, only the x-loop is done in assembly. I'd prefer this approach to keep the assembly to a minimum - especially as there will be several variants of the same routine to write and to maintain. This also means that the processing of the last column and last row stay in C++, but these should only be a fraction of the methods total runtime.

I /think/ this limit to the x-loop should be no performance problem in most cases. IIUC the length of the loop will depend on the currently processed clipping rectangle, this should normally be several pixels wide. But if the clipping rectangles get very small horizontally, the vector routines will perform worse than the C++ code due to the overhead of the function call and internal setup. If you think this will be a problem, I'll change the scope of the routine to include the y-loop.

Christian

Other related posts: