[haiku-development] Re: I/O Scheduler experiment

  • From: Axel Dörfler <axeld@xxxxxxxxxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Fri, 4 Oct 2019 10:43:07 +0200

Am 25/09/2019 um 09:46 schrieb Kyle Ambroff-Kao:

* IOSchedulerSimple tries to throttle itself by not submitting more than 4MB
   of I/O operations at a time, but this is a made up number and the device may
   have much more bandwidth than that.

As waddlesplash mentioned, this also prevents one thread from hogging the disk. However, 4 MB really does not seem to be appropriate for SSDs or even HDs anymore.
It would be nice if these numbers were not fixed but adapt itself to the hardware over time -- they could start with a much higher value, anyway.

Even 100MB could only be 1/30th of the bandwidth of the device, so it's hard to make any fixed guesses here. Furthermore, it should always be high enough that a single thread can saturate the bandwidth, if there is no contention -- so the I/O scheduler could even notice if there is a need to throttle at all.

* There is actually an opportunity for IOSchedulerSimple to merge adjacent
   IOOperations after sorting, which would be very useful for a device
that has very
   few DMA buffers allocated. While not merging the operations could mean that
   overlapping operations fetch unnecessarily, they should be fetching
from the cache
   on the device so v0v.

Not sure I understand what you mean by that.

* IOSchedulerSimple makes up a 512KiB limit for each thread per round, so if
   one thread submits several megabytes of adjacent requests while the
rest of the
   system is idle, those operations will be submitted in, for the worst
case, 512KiB
   chunks.

That does sound like an easy fix :-)
Also, that number should have similar flexibility as mentioned above.

* All of the work IOSchedulerSimple is doing is probably pretty meaningless on a
   flash device that doesn't have a spinning platter.

That depends. Our AHCI implementation is currently suboptimal, as it only uses a single request at a time -- it doesn't use the request queue provided by the device (I don't remember if that queue is really a queue, or can be worked on in parallel -- probably sequentially, so changing that won't really change that situation). So latency resp. fairness is definitely an issue.

* Many call sites of IOScheduler::ScheduleRequest(IORequest*) just block on the
   request anyway, so the extra latency incurred by queueing and
context switching
   between threads isn't worth it.

That really depends on the point above: if the request is small, you should be right when the device is an SSD, at least. One could use a threshold for those. Or disable any fairness slow-downs once the device is very fast.

* The single thread in IOSchedulerSimple may not be able to saturate most modern
   block devices even without the self throttling.

And that should definitely be fixed.

* The block device drivers seem to choose the block size of 512, and
   IOSchedulerSimple chooses 512 if the driver doesn't provide a non-zero value.
   This doesn't seem optimal to me, especially since it doesn't align
with the block
   size of the filesystem, but I haven't done anything to experiment with this.

It should always work on the block size of the underlying device.

* It seems that for some block device drivers, such as nvme_disk,
   IOSchedulerSimple is not used. Which seems appropriate.

That depends :-)
If the device is able to process the requests in parallel, then yes. Otherwise, the already mentioned issues come into play.

* Since low latency is likely a goal, it would probably make sense to prioritize
   reads ahead of writes (with some limitations).

Writing latency could be equally important. I think this should entirely depend on the I/O priority (which IIRC we derive from the thread priority, which would sound logical to me).

I found your larger Google docs document too late, and I don't have time to read through it now, so please ignore questions you already cleared up there.

1. Did you test on real hardware?
2. Did you disable kernel debug mode on Haiku?

Both should always be answered to with yes when you do any performance optimizations (of course, you can test in a VM, too, but you cannot always trust those numbers, and the implementation of accessing I/O might differ substantially).

3. If you are interested in disk bandwidth, and disk latency, you should measure that one, and eliminate as many components as possible (like the file system, or VM) out of the equation. This would give you a much clearer idea of the impact of your changes.
4. You can of course still test with different software, and file systems to measure how much of your improvements end up being visible in different situations.

In any case, optimizations like those are highly appreciated, thanks for working on it! :-)

Bye,
   Axel.

Other related posts: