[haiku-development] Re: I/O Scheduler experiment
- From: Axel Dörfler <axeld@xxxxxxxxxxxxxxxx>
- To: haiku-development@xxxxxxxxxxxxx
- Date: Fri, 4 Oct 2019 10:43:07 +0200
Am 25/09/2019 um 09:46 schrieb Kyle Ambroff-Kao:
* IOSchedulerSimple tries to throttle itself by not submitting more than 4MB
of I/O operations at a time, but this is a made up number and the device may
have much more bandwidth than that.
As waddlesplash mentioned, this also prevents one thread from hogging
the disk. However, 4 MB really does not seem to be appropriate for SSDs
or even HDs anymore.
It would be nice if these numbers were not fixed but adapt itself to the
hardware over time -- they could start with a much higher value, anyway.
Even 100MB could only be 1/30th of the bandwidth of the device, so it's
hard to make any fixed guesses here. Furthermore, it should always be
high enough that a single thread can saturate the bandwidth, if there is
no contention -- so the I/O scheduler could even notice if there is a
need to throttle at all.
* There is actually an opportunity for IOSchedulerSimple to merge adjacent
IOOperations after sorting, which would be very useful for a device
that has very
few DMA buffers allocated. While not merging the operations could mean that
overlapping operations fetch unnecessarily, they should be fetching
from the cache
on the device so v0v.
Not sure I understand what you mean by that.
* IOSchedulerSimple makes up a 512KiB limit for each thread per round, so if
one thread submits several megabytes of adjacent requests while the
rest of the
system is idle, those operations will be submitted in, for the worst
case, 512KiB
chunks.
That does sound like an easy fix :-)
Also, that number should have similar flexibility as mentioned above.
* All of the work IOSchedulerSimple is doing is probably pretty meaningless on a
flash device that doesn't have a spinning platter.
That depends. Our AHCI implementation is currently suboptimal, as it
only uses a single request at a time -- it doesn't use the request queue
provided by the device (I don't remember if that queue is really a
queue, or can be worked on in parallel -- probably sequentially, so
changing that won't really change that situation). So latency resp.
fairness is definitely an issue.
* Many call sites of IOScheduler::ScheduleRequest(IORequest*) just block on the
request anyway, so the extra latency incurred by queueing and
context switching
between threads isn't worth it.
That really depends on the point above: if the request is small, you
should be right when the device is an SSD, at least. One could use a
threshold for those. Or disable any fairness slow-downs once the device
is very fast.
* The single thread in IOSchedulerSimple may not be able to saturate most modern
block devices even without the self throttling.
And that should definitely be fixed.
* The block device drivers seem to choose the block size of 512, and
IOSchedulerSimple chooses 512 if the driver doesn't provide a non-zero value.
This doesn't seem optimal to me, especially since it doesn't align
with the block
size of the filesystem, but I haven't done anything to experiment with this.
It should always work on the block size of the underlying device.
* It seems that for some block device drivers, such as nvme_disk,
IOSchedulerSimple is not used. Which seems appropriate.
That depends :-)
If the device is able to process the requests in parallel, then yes.
Otherwise, the already mentioned issues come into play.
* Since low latency is likely a goal, it would probably make sense to prioritize
reads ahead of writes (with some limitations).
Writing latency could be equally important. I think this should entirely
depend on the I/O priority (which IIRC we derive from the thread
priority, which would sound logical to me).
I found your larger Google docs document too late, and I don't have time
to read through it now, so please ignore questions you already cleared
up there.
1. Did you test on real hardware?
2. Did you disable kernel debug mode on Haiku?
Both should always be answered to with yes when you do any performance
optimizations (of course, you can test in a VM, too, but you cannot
always trust those numbers, and the implementation of accessing I/O
might differ substantially).
3. If you are interested in disk bandwidth, and disk latency, you should
measure that one, and eliminate as many components as possible (like the
file system, or VM) out of the equation. This would give you a much
clearer idea of the impact of your changes.
4. You can of course still test with different software, and file
systems to measure how much of your improvements end up being visible in
different situations.
In any case, optimizations like those are highly appreciated, thanks for
working on it! :-)
Bye,
Axel.
Other related posts: