Good afternoon,
A quick status update. We have the "basic" services running in a VM at scaleway
at the moment while we catch out breaths.
I've submitted a plan to the Inc. to "fix" the issues we've experienced this
year.
One large issue we have had (not just this year) has been hosting "Everything"
on a single big server. This has given us a big
price break, but historically we have had issues:
* Nobody wants to upgrade (for obvious reasons after this year :-) )
* Maintenance means outages... no way around it.
* Access to the servers at Hetzner to troubleshoot is limited.
This has been a bad combination which has impacted us over the long-haul. Now
that things are more portable, I think it's time
we start working smarter.
Outages of our package repositories is going to rapidly become more of an issue
going forward.. we *really* need a reliable
solution to make sure anyone using Haiku has the best experience possible
(especially as R1 comes up).
The solution i'm proposing is as follows:
2 x bare metal storage nodes running CEPH (each with a single 1TiB disk)
2 x bare metal compute nodes
The compute nodes will be active + hot standby. They will be configured
identically, but at any given time:
* one will run all of our infrastructure
* one will run builders in qemu stored on our CEPH cluster over rdb.
With this configuration, we can move a lot closer to 100% uptime during
maintenance, and have a rollback plan:
* Upgrade hot-standby server + other maintenance
* Test hot-standby server to ensure working as expected.
* Shutdown builders on it
* Swap active server to hot-standby via single DNS CNAME change
* Test functionality.
* Problems? Change back to previously active node.
* Success? Apply updates to previously active node and start builders on it.
This isn't a "kubernetes" level of uptime, but it gives us options for
maintenance which don't involve a 100% outage without
all the stress of full blown kubernetes. This also gives us potential upgrade
paths to kubernetes in the future. (since we'll
already have the shared storage)
-- Alex