#14560: monitoring of maui builders
-------------------------+----------------------------
Reporter: korli | Owner: haiku-web
Type: bug | Status: new
Priority: normal | Milestone: Unscheduled
Component: Sys-Admin | Version: R1/Development
Resolution: | Keywords:
Blocked By: | Blocking:
Has a Patch: 0 | Platform: All
-------------------------+----------------------------
Comment (by mmlr):
A watchdog wouldn't really help in most cases. These are the things that
happen from time to time:
* KDLs: Right now they don't trigger an automatic reboot as I usually get
around to checking them and it's nicer to be able to debug. When noone is
able to spend the time we can set bluescreen false in the kernel settings
to avoid this need. A watchdog would trigger in these cases and would do
pretty much the same thing.
* Stuck downloads: These seem to happen less often lately. Generally they
are hard to automatically handle as the download sizes of the ports as
well as the speed of their source servers vary wildly, which makes a
simple timeout impractical. Instead download progress would need to be
measured and in cases of an actually stuck download it should be
restarted, eventually using up all retries. This would mean in-sourcing
the download process into HaikuPorter or checking if the currently used
wget can be configured to do the same and adding the appropriate
parameters.
* Stuck package activation: There is a race condition somewhere in package
activation that irregularly leads to build packages not getting activated.
I haven't been able to further investigate this unfortunately. A timeout
on the build package activation would be relatively simple though.
* Stuck build process: Handling a stuck build process is probably the most
difficult to handle automatically as there is no real universal way to
check for a progressing build. A simple timeout is again a rather poor fit
considering how long some of our larger packages tend to take to build. A
timeout for phases without any log output might work, but it'd have to be
rather long to not produce false positives which would be especially
frustrating on such long running builds.
* Stuck virtio block: I've checked what's going on in the case that
prompted this ticket and it looks like the virtio block driver is stuck
and stopped processing requests. The system itself is still responsive,
but all disk IO to the build volume is blocked. The boot volume seems
fine, the logs are clean. It was not possible to attach the debugger to
the running git process so I entered the kernel debugger, which revealed
that the git thread is waiting for the virtio block driver to finish an IO
request.
The only case a watchdog or a network based poll would trigger is for
KDLs. And a presumable automatic reboot in such cases can instead be done
via the kernel settings. The other cases are more difficult and so far
handling them case by case worked out mostly ok. It may however make sense
to widen the group of people who can access the VMs via libvirt so that
more people could handle these cases when I am unavailable for some time.
--
Ticket URL: <https://dev.haiku-os.org/ticket/14560#comment:2>
Haiku <https://dev.haiku-os.org>
The Haiku operating system.