[contestms-dev] Fwd: Better Public Cloud support for CMS

  • From: Motiejus Jakštys <desired.mta@xxxxxxxxx>
  • To: contestms@xxxxxxxxxxxxx, contestms-dev@xxxxxxxxxxxxx
  • Date: Thu, 1 Oct 2015 09:05:37 +0100

+contestms-dev@

Please reply to this email (not the first one).

---------- Forwarded message ----------
Hi all,

Since last year, in Lithuania we are running CMS on a public cloud
except finals. This year we want to include on-site finals. There are
a few inconveniences that, if solved, would make our lives easier.
First, how we use CMS.

* Bootstrap PostgreSQL on a hosted database (AWS RDS). A few hours
before the contest, we increase the size of the database to something
expensive and powerful. After the contest stops, we downgrade the DB
back to the minimal one. So database endpoint is always static and
taken care of.
* Workers and CWSs are separate clusters on different availability
zones, also spawned a few hours before the contest. We have two age
groups (two contest IDs), that requires two sets of CWSs. Workers go
to a common pool on different AZs.
* Server with the rest of the services (admin, queue, grading) is
always one, hand-configured, with medium performance. We call it
"management".

Reliability-wise, we are quite happy. The only SPOF is the
"management", but, since we can tolerate a few minutes of downtime to
allow it to re-bootstrap, it's kind-of OK. Most heavily loaded are
CWS'es. During the duration of the contest, we check the CPU usage of
these, and, when needed, add extra hosts. Here's a 15x
over-provisioned cluster from last year, the first time we ran the
contest on AWS[1].

As you could guess, the most annoying inconvenience is that the
servers change during the duration of the contest. Their IPs are
hard-coded into cms.conf, which requires a change cms.conf and restart
of the whole cluster when a new machine is added. We have some
primitive reconfiguration automation now: servers periodically
download cms.conf from a central location, and, if changes are
detected, restart the service locally. That still requires
hand-modifying cms.conf while being extra careful, and requires to
know the server IPs. Ideally, we'd love to avoid manually changing the
cms.conf when servers are added or removed.

Ideally, the servers could auto-discover themselves. You start a
server (or a Docker container), it knows its role. It should be able
to add itself to the cluster automatically without further
handholding.

The solution would be to set up a separate etcd/consul cluster to deal
with this stuff: it could generate cms.conf and restart the server
when needed. Before I do this, I have a few questions:

1. Does this mechanism sound reasonable? Maybe, given some engineering
time, there is a better way to deal with this problem?
2. Is there anybody else interested in dynamic CMS clusters? Having
someone else to talk to would already be a benefit. I am thinking of
writing a design document. Besides the CMS core developers, are there
any users that would like to contribute/participate?
3. Maybe someone wants to volunteer an implementation? It is a cool
project, and I wouldn't be surprised if someone did. My time is very
limited, otherwise I would do it myself. I could happily consult,
review and provide feedback. If no-one wants to do it, I`ll look for
an interested student from high-school as a small final-year project
(or something like that).
4. Is this something upstream is interested in merging? This could be
anywhere between a completely separate project that just generates
cms.conf and generically restarts cmsResourceService, to a built-in
service in the CMS code-base. That would mean a dependency on one of
the (consul/etcd/zookeeper/etc). What do core developers think?

... not fully related ...
5. Is it possible to have cmsChecker, cmsEvaluationService and
cmsScoringService on >1 node at a time?

Thanks,
Motiejus Jakštys

[1]:
https://scontent-ams3-1.xx.fbcdn.net/hphotos-xtp1/t31.0-8/10712382_1582190968693324_5182991449106511357_o.png


--
Motiejus Jakštys

Other related posts: