[openbeos] Re: app_server: MMX/SSE help wanted

I think your being overly harsh on SSE(2), the thing to note here isn't that SSE's smaller register set is the limitation but that the functional implementation of SSE in the hardware is the problem.

As Christian pointed out there just isn't enough bandwidth available to desktop parts, bearing in mind that something like a P4 can resolve 2x128b l/s operations per clock that's 32B/clk that's 89.4GB/s @ 3GHz of l/s bandwidth. Assuming that all your data is inside the L1 or even L2 then you can expect 3/4 clk latency as a best case so the bandwidth is now down to 22.35GB/s average load/store.

Now 22.35GB/s peak theoretical bandwidth isn't possible to sustain given current memory controllers are still only offering 6.4+GB/s. Integrated memory controllers are at about 90-95% efficient, but the P4 is only at about 80-85% efficient.

Now our processor is only really going to see around 5.8GB/s then and that's assuming that the data is being prefetched nicely.

If I remember correctly the P4 has a dual read port FRAT with 256 entries of 128b that runs at full core speed allowing a theoretical 6B read operations per second again equating to 89.4GB/s. Of those 256 registers that's mapping to a 15 deep pipeline,
the theoretical maximum parrelism extractable from those 8 registers is 3-way, so at any given time three instructions can be executed on safely. That's 7 sets of two registers in flight at any given time, given that the P4 has a 15 deep FPU then that even in a best case scenario it can only just fill all stages of the pipeline.


x87 code on the other hand uses a stack so its much more difficult to attain parrallism since nearly all values have to write back before they can begin the next stage.

Adding an AltiVec style separation of SIMD from FP would allow a simpler scheduler to be used for SIMD instructions but since the FPU's pipeline is rarely saturated in the P4 then it doesn't really have much use for an extra pipe.

Separating the execution of SIMD and Scalar FP could really only be justified if there were enough bandwidth to keep the registers full.

Properly optimized SSE(2) code is just as fast (and faster) than comparable AltiVec.

I can justify my arguments by pointing at the K8's poor performance in SSE2. The K7/K8 family are heavily focused on x87 operations, with the a 72 entry re-order specifically for x86 operations. The K8 would benefit from Separating the Scalar FPU and SIMD execution because most of the SIMD operations on the XMM registers could be re-ordered with far greater ease if they were independent from the x87 reorder. Separating would remove two stages that probably aren't necessary for the SIMD operations.

Admittedly the K8 also suffers because it can only fetch 128b across two cycles which makes it suffer significantly with SSE(2) load/store operations.

I've probably made plenty of mistakes, so feel free to pull my idea apart. I'll do my best to cellotape it back together :)

bye

From: Christian Packmann <Christian.Packmann@xxxxxx>
Reply-To: openbeos@xxxxxxxxxxxxx
To: openbeos@xxxxxxxxxxxxx
Subject: [openbeos] Re: app_server: MMX/SSE help wanted
Date: Tue, 10 Aug 2004 22:06:50 +0200

On 2004-08-10 13:19:40 [+0200], Adi Oanca wrote:
> Christian Packmann wrote:

>> They've got a nice introductory article about SIMD architectures, going
>> into some detail on MMX, SSE, 3DNow! and PowerPCs AltiVec; this would
>> be a good read for anybody interested in the possibilities of SIMD:
>> <http://arstechnica.com/cpu/1q00/simd/simd-1.html>
>
>     Oh my god, Intel/AMD's SIMD implementation sucks BIG time!

Hey, x86 sucks big time, what did you expect? ;)

>     Altivec is by far the best implementation and the easiest to work
>     with! Also, it using 32 completely independent registers would make
>     it about 2 times faster than SSE2/3.

For heavy scientific computations, not for desktop use. The main limitation
with SIMD usually is RAM bandwidth, not processing power.

Only the PPC970 is really interesting, as it has a frontside bus comparable
to modern AMD/Intel CPUs. Motorolas 74xx series is crippled by its slow
system bus, only 133MHz SDR IIRC - that's equivalent to a PIII. A 74xx will
spend most of the time waiting for data to arrive - I can have that with my
XP, which is much cheaper.

>     If only SSE4 == Altivec... ehhhh...

Well, x86/64 offers 16 128bit registers. That's enough for most purposes,
even if it isn't quite as nice as AltiVec. But if we should ever get OBOS
on  affordable PPC970 machines, I'd be tempted to recycle my x86 as
Windows-only game console...

Bye,
Chris


_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/



Other related posts: