On 2009-06-16 at 15:09:06 [+0200], André Braga <meianoite@xxxxxxxxx> wrote: > Em 16/06/2009, às 09:14, Christian Packmann > <Christian.Packmann@xxxxxx> escreveu: > > The SSE2/SSSE3 routines are also improved. Of the unrolled versions > > only the SSSE3 variant is finished, the MMX and SSE2 variants need more > > work. I'm sceptical that they will yield much improvement, anyway; the > > unrolled SSSE3 only gives 14% more performance than the unrolled > > version, I don't think improvements will be much greater for MMX/SSE2, > > but maybe some CPUs will perform well on them. > > Just for kicks, could you compile a static .o for AMD64 that we could > then link to produce an executable for a 64-bit OS of choice? I'd like to > see what GCC4.2+ manage to do to your code with extra registers, > optimization levels and autovectorization switches. > > Also, I see that you have SSSE3 versions for the routines, but why not > SSE3 "plain" with 33.33% less S? :) > > No useful added functionality in those 13 extra instructions compared to > what you're already doing in SSE2? Hm, it appears the optimized code is already about twice as fast as the plain C code for pretty much every architecture that was benchmarked. So what I would love to see is a patch against app_server which integrates this code so I can watch movies fullscreen with smooth scaling. If the code can later be made even faster, nice, but it's darn useful already. Or would be if there were a patch. ;-) Best regards, -Stephan