[haiku] Re: Haiku's SMP

  • From: "Cyan" <cyanh256@xxxxxxxxxxxx>
  • To: haiku@xxxxxxxxxxxxx
  • Date: Mon, 17 Nov 2008 22:58:28 GMT

Nick <tonestone57@xxxxxxxxxxx> wrote:

> I used the Chart benchmark because it was quick, simple and CPU 
> intensive..  Only thing is that it supports 1 and 2 Threads meaning
> it's only good for single and dual core benches & comparisons.
> Chart is not good for testing quad core performance.  Require app
> with 4+ CPU intensive threads.  Video encoder?

I don't think any of the currently-available video encoders use
multiple threads yet. One application that comes to mind offhand is
XaoS -- a realtime fractal zooming app. I haven't tested it with
Haiku, but under R5 it uses all four cores to accelerate zooming.

Benchmarking is quite difficult really; there are so many variables.
The old BeOS standby is usually Teapot, and when set to multiple-
launch mode, several copies can be spawned (conducted in R5):

One teapot, four cores:
http://knothole.no-ip.org/Tea1

Two teapots, four cores:
http://knothole.no-ip.org/Tea2

Three teapots, four cores:
http://knothole.no-ip.org/Tea3
(note the imperfect balancing here -- partly due to R5's scheduler,
possibly also due to teapot positions)

Four teapots, four cores:
http://knothole.no-ip.org/Tea4

Notice how the performance drops after each teapot is added, and
the sharp drop after it goes from two to three teapots.
Two reasons: Video bandwidth is very limited (PCIe x1 card) which
is a resource shared by all four CPUs. And the memory bandwidth is
also shared between all four CPUs.

The sudden drop after two teapots is due to the cache architecture
of the Intel Core 2 Quad -- each pair of CPUs shares a common cache,
but the two pairs have independent caches.

You can see the difference cache sharing makes by comparing:

http://knothole.no-ip.org/Teacache1
(CPUs #1 and #2 enabled)

http://knothole.no-ip.org/Teacache2
(CPUs #1 and #3 enabled)


Different applications will exhibit dramatically different behaviour
depending on how they access memory. The only algorithms which
scale completely smoothly with the number of CPUs are those which
fit entirely (with data) into the L1 cache of each core, which is
really quite small on the Intel chips.

So yeah, finding a good benchmark for SMP systems is going to be
very difficult. The simplest option is to launch a bunch of CPU-
intensive apps (preferably ones with low memory requirements and
minimal video output) and measure the amount of slowdown with each
extra application. If SMP is working properly, it should be better
than half for each doubling of the number of instances (up to the
number of CPUs), but how much better depends on very many factors.

Other related posts: