#14979: CPU scheduler has scalability issues on SMP systems with 10+ CPUs
-----------------------+------------------------------
Reporter: ambroff | Owner: nobody
Type: bug | Status: new
Priority: normal | Milestone: Unscheduled
Component: - General | Version: R1/Development
Keywords: | Blocked By:
Blocking: | Has a Patch: 0
Platform: x86-64 |
-----------------------+------------------------------
tl;dr: I found that the CPU scheduler has scalability issues on a x86_64
system with a large number of CPU cores, and narrowed it down to
`arch_smp_send_ici(int32)`
Detailed summary:
I recently bought a new development workstation with a 12 Core (24 thread)
CPU (Ryzen Threadripper 1920x). It's Linux for a while and has been great.
I decided recently to try Haiku on this computer to see if there were any
interesting hardware issues to debug. So I grabbed R1 Beta1 and tried
booting it up.
It booted, but it took *ages* to get to the livecd language prompt.
Something like 20 minutes. The system was completely unusable. Extremely
unresponsive to the point where I could barely move the mouse or select
the "Boot to Desktop" button.
I decided to investigate, thinking it might be a userspace performance
issue. I found that I could reproduce this easily in Virtual Box
configuring a virtual machine with 12 cores (which I regularly do for
Linux and Windows virtual machines on this system, which has always been
fine). I tried booting with different CPU configurations from 1 core to 12
cores, and anecdotally, I noticed that the system felt slower with more
cores.
I found the system profiler by looking through the code and tried
collecting some samples with 2 cores and 12 cores to compare them (which
was painful because the system was so unbearably slow with 12 cores). This
ended up not being very useful, since whatever was causing the system to
feel so slow wasn't being instrumented by the profiler.
After this I found the scheduler profiler, and that's where things started
to click into place.
[[https://github.com/haiku/haiku/blob/3142fb6996948dd5e539ddcb56b0b81fe223cd26/src/system/kernel/scheduler/scheduler.cpp#L316|reschedule(int32)]]
is something like 35x slower with [[https://imgur.com/a/9x1Ccyf|12 cores]]
than it is with [[https://imgur.com/a/RyeIXdi|2 cores]] on average. The
bulk of the cost seems to be in
[[https://github.com/haiku/haiku/blob/3142fb6996948dd5e539ddcb56b0b81fe223cd26/src/system/kernel/scheduler/scheduler.cpp#L96|enqueue(Thread*,
bool)]].
Disabling CPUs at runtime, even though it's painful, makes the system more
and more responsive. Once I had disabled all but 4 or 6 cpus the system
was much more responsive. I also noticed that when I changed the scheduler
mode to powersave mode the system was fairly responsive, even with all 12
cores enabled.
That lead me to compare the powersave and low_latency modes to understand
why powersave mode was so much more responsive. Since powersave mode tries
to
[[https://github.com/haiku/haiku/blob/master/src/system/kernel/scheduler/power_saving.cpp#L118|avoid
rebalancing]] unless load on the current CPU is high, I assume that
rebalancing itself is where the scaling issues are hidden.
I decided to check whether more CPUs lead to a higher thread migration
rate, meaning how often rebalance leads to a thread being scheduled on a
different CPUs run queue. So I added some
[[https://github.com/ambroff/haiku/commit/82af36353d736ecbdd209f7e0e42fe279eaaf48f|quick
hacky stats]] to get some idea. Comparing the average thread migration
count with [[https://imgur.com/a/NBXWIIa|two CPUs]] and
[[https://imgur.com/a/CXVrkcy|twelve CPUs]] didn't show that there is any
real difference in the number of times threads migrate between CPU cores.
Looking at
[[https://github.com/haiku/haiku/blob/3142fb6996948dd5e539ddcb56b0b81fe223cd26/src/system/kernel/scheduler/scheduler.cpp#L96|enqueue(Thread*,
bool)]], one of the only part that isn't really instrumented is
[[https://github.com/haiku/haiku/blob/3142fb6996948dd5e539ddcb56b0b81fe223cd26/src/system/kernel/scheduler/scheduler.cpp#L136|this]]:
{{{
int32 heapPriority = do_get_heap_priority(targetCPU);
if (threadPriority > heapPriority
|| (threadPriority == heapPriority && rescheduleNeeded)) {
if (targetCPU->ID() == smp_get_current_cpu())
gCPU[targetCPU->ID()].invoke_scheduler = true;
else {
smp_send_ici(targetCPU->ID(), SMP_MSG_RESCHEDULE,
0, 0, 0,
NULL, SMP_MSG_FLAG_ASYNC);
}
}
}}}
So if it has been decided by the rebalance() strategy to enqueue this in
another thread's run queue, then we write a message into its smp message
queue and send a ICI message via APIC.
After
[[https://github.com/ambroff/haiku/commit/1f90412f3aec8c021fce6ae9eb5b5b8a47421424|instrumenting
this code]] it definitely seems that smp_send_ici() is the [hottest
function](https://imgur.com/a/L90TEQz).
I added some additional hacky stats to see which part of smp_send_ici() is
slow. I suspected that we may have contention in the compare-and-set
adding the smp message to the target CPUs queue, or waiting for the
interrupt was slow.
After adding a
[[https://github.com/ambroff/haiku/commit/e3bc5ebe401b369759fcd4361c8a35b34a40eb79|quick
and dirty hack]] to get an understanding of how many CAS attempts are
required, and how long
[[https://github.com/haiku/haiku/blob/master/src/system/kernel/arch/x86/arch_smp.cpp#L207|arch_smp_send_ici(int32)]]
took, I got some [[https://imgur.com/a/oicmZpY|interesting results]].
* CAS attempts: avg=1, max=1
* Send ICI Latency: avg=3611233ns, max=10557740ns
Average 3ms, max 10ms! So it doesn't seem like there is any contention for
the SMP message queue, but waiting for the interrupt takes forever.
Note that the performance is not NUMA related, as VirtualBox reports a
single NUMA node to the guest OS, and I've been running other operating
systems for work with 12 cores for several months with no issue.
I'm unsure as to whether this is a problem on other platforms, since I
only have x86_64 machines with this many CPUs that I can test with.
--
Ticket URL: <https://dev.haiku-os.org/ticket/14979>
Haiku <https://dev.haiku-os.org>
The Haiku operating system.