[contestms-dev] What's up with cms-dev - biweekly edition

  • From: Stefano Maggiolo <s.maggiolo@xxxxxxxxx>
  • To: contestms-dev <contestms-dev@xxxxxxxxxxxxx>
  • Date: Thu, 15 Oct 2015 00:48:58 +0100

Hello!

One of the (very reasonable) feedback I received was that our
development is not transparent enough. One problem is that dev is very
dependant on the availability of time, and also on what feels more
gratificating at the moment - again, volunteer project, so I believe
this is fair; but definitely it doesn't allow us to do much
planning. I'd like to improve on that as well, but let's see.

For the moment, I'd like to give a periodical account of what I've
been working on in the last few weeks, more or less, with also some
outlook on what I'd like to touch next. Please tag along this thread
with your account, if you want!

So, on my side, my time lately has been going mostly on reviewing
several PRs, and on organizing and solving some of the problems that
arose in the IOI, mostly ES's poor performances.

*** Post IOI changes and reviewing - CALL TO ACTION!

I have a few changes still to merge after the IOI. This mostly because
we are lacking expert reviewers - if you have experience with CMS
code, please help us looking through the recent open PRs and leave a
message for problems, or just to say it looks good.

For example:
https://github.com/cms-dev/cms/pull/483
https://github.com/cms-dev/cms/pull/452
https://github.com/cms-dev/cms/pull/447

*** ES poor performance

If many worker are present, ES starts using 100% CPU, used mostly on
code within SQLAlchemy. Note that since time is spent in actual Python
code (Postgres is much faster than Alchemy), the greenlets don't help
us going over 100% CPU. This is in not much Alchemy's fault, but of
the way we're using it.

There are two paths here. We can decide to mitigate the problem, or to
solve it entirely.

1. To mitigate it, we just need to make ES more efficient. Possible
solutions:

a. do not interact with Alchemy every time we get a response from a
Worker, but do it once in a while (at least until we get all the
results for a submission);

b. remove all checks for "operation already executed" - we will get
the occasional ignorable integrity error, which we already get as
the check is on insertion, not on extraction; we might also
offload the check to the workers, if we want to;

c. a sometimes heard suggestion is to use long standing sessions;

d. ?! please add your suggestions!

I wrote a script to test how fast ES is, and tested these three
possibility (well, not the offload to the workers) - a. and b. look
very promising, whereas c. is almost useless.

2. Really scalable solutions.

a. probably offloading also the part where we write the result to the
DB to the workers would make ES scale much more (maybe not
infinitely, but close); on the other hand we would have some
tricky synchronization issues, and the submission handle logic
would be spread between ES and workers;

b. sharding ES! a relatively easy way would be to let each ES handle
a subset of the submissions, and a subset of the workers; a more
elegant one would put another component in the middle negotiating
the worker access.

Honestly, I believe we can postpone these solutions, and that solving
1. would be enough. But hopefully having a tool to test and keep track
of the speed of ES would be enough to make sure of that.

*** Next steps

I plan to continue investigating in ES some more, and then propose
patches to mitigate the problem and tools to keep track of
regressions.

Another possible future direction is to redesign AWS's UI a little
bit, waiting for the real refactoring William is working on. Even if
this is some duplication of work, I'm not sure how soon we might have
the new AWS, and the current UI is very unusable.

Cheers,
Stefano

Other related posts: