[contestms] Re: Handling large number of total users/submissions

  • From: Artem Iglikov <artem.iglikov@xxxxxxxxx>
  • To: contestms@xxxxxxxxxxxxx
  • Date: Tue, 29 Apr 2014 23:26:26 +0600

Hello again.

Just to clarify, which one of the issues are we talking about?

If it is the one that I mentioned several weeks ago, I have no access to
the hardware used on that system, and I cannot reproduce the issue on new
hardware. Because of this, I think, this is something hardware related
(slowness, bugginess, some relation to magnetic storms...).

If we are talking about the issues I mentioned in this thread, then, as I
pointed out, the running out of connections seems to be expected for me,
because by default the PostgreSQL has 100 connections allowed I had several
instances of CWS and many users.

I have just ran a quick stress test with default settings and couldn't run
out of connections. Probably, I have to fill database a bit more. Anyway,
I'll certainly will do a lot of stress testing during next several days,
and if I will be able to repeat the result, I'll send you logs.

Also, I would like to note that the #254 reproduces easily. I'm not sure
about the memory usage - I need to wait when the evaluation finishes (right
now ES takes 752m with 600 submissions evaluated and 24000 submissions
being evaluated).

By now the situation with database connections is like the one in
attachment. The number do not change if I stop stress testing (but probably
I don't wait too long after stopping). If you think that there is something
unusual, I can send you logs of CWS, but could you give me your public GPG
key for them?



On Tue, Apr 29, 2014 at 6:52 PM, Artem Iglikov <artem.iglikov@xxxxxxxxx>wrote:

> I was quite busy these days so haven't done any additional tests, but I'll
> try to reproduce the issue today with your patch applied. Thanks.
>  On Apr 29, 2014 6:42 PM, "Luca Wehrstedt" <luca.wehrstedt@xxxxxxxxx>
> wrote:
>
>> I'd like to fix the database connections issue but neither I nor Giovanni
>> have been able to reproduce it. We need to diagnose it on your system, I'm
>> sorry.
>>
>> Could you please apply again the patch I'm attaching, manually start CWS
>> from the shell, redirecting its standard output & error to file, reproduce
>> the issue and send us that file? We need the stdout+stderr, as it's the
>> only place where a detailed access log is available (each request, with URL
>> and other info). Thanks for your help!
>>
>> Luca
>>
>> PS: the extreme memory use is also unexpected; as it may be related to
>> the connection issue I'll tackle it after we've solved this.
>>
>>
>> On Wed, Apr 23, 2014 at 9:36 AM, Artem Iglikov 
>> <artem.iglikov@xxxxxxxxx>wrote:
>>
>>> Thank you guys for analysing the situation.
>>>
>>> Just a small correction, my estimation of 20000 submits is for overall
>>> number of submits made during all virtual contests, which will be
>>> distributed more or less evenly during 1 or even 2 days.
>>>
>>> And stress testing from master doesn't work again :-) The last commit
>>> broke something in storing logs, I suppose.
>>>
>>> I did a stress testing with 400 actors (4 instances of StressTest.py
>>> with 100 actors each), 12 workers, about 20000 submits in total and default
>>> database settings without PgBouncer, and seemed fine except these:
>>> - I've got exception from #254 a few times
>>> - I've ran out of db connections (which was expected)
>>>  - possibly because of problems with db connections two workers died. I
>>> saw that they were in "compiling" state for about 10 minutes (though in
>>> their console I saw that the job was already done) and after that they
>>> became disabled. I guess they were not able to deliver the result back to
>>> ES and because of these ES kicked them (maybe I'm wrong)
>>> - AWS overview page is not supposed to handle a very large queue, each
>>> refresh forces ES to use 100% of CPU. This obviously shouldn't be the
>>> case during real contest if everything goes as expected, but I'm going
>>> to fix this anyway, partially done here:
>>> https://github.com/artikz/cms/commit/50a1c3235a374bf5695178058c27c4e798c1f096
>>> .
>>>
>>> Then I've repeated the stress test, but now tweaked database
>>> max_connections limit to 200 (also had to increase SHMMAX). And only
>>> problems that left were #254 and AWS overview page.
>>>
>>> All these problems seems minor ones to me except one: I couldn't get a
>>> "dead" worker back alive. Restarting worker doesn't help, seems only
>>> restarting ES (or another core service) works. Is it for some reason?
>>> Shouldn't the "death" state of a worker be cleared when worker reconnects?
>>>
>>>
>>>
>>> On Tue, Apr 22, 2014 at 10:26 PM, Giovanni Mascellani <
>>> mascellani@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>>> Il 22/04/2014 18:23, Luca Chiodini ha scritto:
>>>> >> For instance, a few weeks ago Luca Chiodini complained on this
>>>> >> mailing list that StressTest had a problem, but I didn't have time to
>>>> >> check it out and most probably I won't in the near future.
>>>> >
>>>> > I did, but Artem has already fixed it with #265 [0]
>>>> > and now StessTest works fine.
>>>>
>>>> Yes, sorry, I remembered about that just after having sent my reply.
>>>>
>>>> Gio.
>>>> --
>>>> Giovanni Mascellani <giovanni.mascellani@xxxxxx>
>>>> PhD Student - Scuola Normale Superiore, Pisa, Italy
>>>>
>>>> http://poisson.phc.unipi.it/~mascellani
>>>>
>>>>
>>>
>>>
>>> --
>>> Artem Iglikov
>>>
>>
>>


-- 
Artem Iglikov
# netstat -nap | grep 5432 | grep EST | grep python | awk '{print $7}' | sort | 
uniq -c
      1 2534/python2
     10 2556/python2
     10 2570/python2
     10 2583/python2
      7 2606/python2
      9 2616/python2
      2 2633/python2
     20 3324/python2
# ps ax | grep 3324
 3324 pts/8    Sl+    9:14 /usr/bin/python2 /usr/local/bin/cmsEvaluationService
 4369 pts/25   S+     0:00 grep --color=auto 3324
# ps ax | grep 2583
 2583 pts/23   Sl+   11:38 /usr/bin/python2 /usr/local/bin/cmsContestWebServer 
3 -c 1
 4371 pts/25   S+     0:00 grep --color=auto 2583
# ps ax | grep 2570
 2570 pts/21   Sl+   11:31 /usr/bin/python2 /usr/local/bin/cmsContestWebServer 
2 -c 1
 4373 pts/25   S+     0:00 grep --color=auto 2570
# ps ax | grep 2556
 2556 pts/14   Sl+   11:39 /usr/bin/python2 /usr/local/bin/cmsContestWebServer 
1 -c 1
 4375 pts/25   S+     0:00 grep --color=auto 2556
# ps ax | grep 2616
 2616 pts/10   Sl+   11:40 /usr/bin/python2 /usr/local/bin/cmsContestWebServer 
0 -c 1
 4427 pts/25   S+     0:00 grep --color=auto 2616
# ps ax | grep 2606
 2606 pts/24   Sl+    2:04 /usr/bin/python2 /usr/local/bin/cmsContestWebServer 
4 -c 1
 5590 pts/25   S+     0:00 grep --color=auto 2606





Other related posts: