Attached is a PNG of the spreadsheet I have collated the results in. Nice to see so many helpful people and good to see a result slower than my N270 Atom :-) I would really like to replace the 2 V1 results with V3 results so if you had the AMDX2 or X9650 please send me your latest results. (The last 2 columns are the speedup over the C code and the speedup from 1 to 2 CPUs for the SSE2 code) The minimum speedup from C to SSE2 should be 4 as the code is processing 4 times the data in parallel We generally just beat that on the AMD CPUs and easily beat that on the Intel CPUs (AMDs first generation SSE2 engine was not that good :-( ) Any other speedup comes from more efficient unpacking of YUV data and packing of RGB data (That is basically the changes from V2 to V3). As for scaling from 1 processor to 2 processors. Although there is some benefit, it quickly drops away as the process is limited by memory bandwidth. I am going to start looking at implementing the SSE2 code in libORC (http://www.schleef.org/blog/2009/05/31/orc-040/) and see the differences. The eventual aim is to be able to implement a colour conversion node in the media kit similar to how we have a audio format conversion built in so video codecs like audio codecs can work in the best format they need to. -- Cheers David