[haiku-sysadmin] Infrastructure updates and next steps

From: "Alexander von Gluck IV" <kallisti5@xxxxxxxxxxx>
To: haiku-development@xxxxxxxxxxxxx, haiku-sysadmin@xxxxxxxxxxxxx
Date: Wed, 07 Nov 2018 19:33:54 +0000

Good afternoon,

A quick status update. We have the "basic" services running in a VM at scaleway
at the moment while we catch out breaths.

I've submitted a plan to the Inc. to "fix" the issues we've experienced this
year.
One large issue we have had (not just this year) has been hosting "Everything"
on a single big server. This has given us a big
price break, but historically we have had issues:

* Nobody wants to upgrade (for obvious reasons after this year :-) )
* Maintenance means outages... no way around it.
* Access to the servers at Hetzner to troubleshoot is limited.

This has been a bad combination which has impacted us over the long-haul. Now
that things are more portable, I think it's time
we start working smarter.

Outages of our package repositories is going to rapidly become more of an issue
going forward.. we *really* need a reliable
solution to make sure anyone using Haiku has the best experience possible
(especially as R1 comes up).

The solution i'm proposing is as follows:

2 x bare metal storage nodes running CEPH (each with a single 1TiB disk)
2 x bare metal compute nodes

The compute nodes will be active + hot standby. They will be configured
identically, but at any given time:
* one will run all of our infrastructure
* one will run builders in qemu stored on our CEPH cluster over rdb.

With this configuration, we can move a lot closer to 100% uptime during
maintenance, and have a rollback plan:
* Upgrade hot-standby server + other maintenance
* Test hot-standby server to ensure working as expected.
* Shutdown builders on it
* Swap active server to hot-standby via single DNS CNAME change
* Test functionality.
* Problems? Change back to previously active node.
* Success? Apply updates to previously active node and start builders on it.
This isn't a "kubernetes" level of uptime, but it gives us options for
maintenance which don't involve a 100% outage without
all the stress of full blown kubernetes. This also gives us potential upgrade
paths to kubernetes in the future. (since we'll
already have the shared storage)

-- Alex

Follow-Ups:
- [haiku-sysadmin] Re: Infrastructure updates and next steps
  - From: Alexander von Gluck IV

[haiku-sysadmin] Infrastructure updates and next steps

Other related posts: