[contestms] Re: Handling large number of total users/submissions

  • From: Artem Iglikov <artem.iglikov@xxxxxxxxx>
  • To: contestms@xxxxxxxxxxxxx
  • Date: Fri, 2 May 2014 12:38:55 +0600

Hello again. So, seems large number of evaluated submission is a real
problem for ES, and, consequently for AWS. Seems it doesn't affect
evaluation part of ES, but rather a "presentation" part - to calculate the
number of different submissions it loads evaluation results of all of them
from the database.

Just remembering, we have the contest, starting at May 3, 00:00 UTC and it
will last during all day. If anybody of core developers would be online, I
would appreciate that.


On Wed, Apr 30, 2014 at 1:52 PM, Artem Iglikov <artem.iglikov@xxxxxxxxx>wrote:

> The evaluation is almost finished. I have about 22000 evaluated
> submissions, ES takes 2.6GB, SS takes 1.7GB of RAM.
>
>
> On Tue, Apr 29, 2014 at 11:26 PM, Artem Iglikov 
> <artem.iglikov@xxxxxxxxx>wrote:
>
>> Hello again.
>>
>> Just to clarify, which one of the issues are we talking about?
>>
>> If it is the one that I mentioned several weeks ago, I have no access to
>> the hardware used on that system, and I cannot reproduce the issue on new
>> hardware. Because of this, I think, this is something hardware related
>> (slowness, bugginess, some relation to magnetic storms...).
>>
>> If we are talking about the issues I mentioned in this thread, then, as I
>> pointed out, the running out of connections seems to be expected for me,
>> because by default the PostgreSQL has 100 connections allowed I had several
>> instances of CWS and many users.
>>
>> I have just ran a quick stress test with default settings and couldn't
>> run out of connections. Probably, I have to fill database a bit more.
>> Anyway, I'll certainly will do a lot of stress testing during next several
>> days, and if I will be able to repeat the result, I'll send you logs.
>>
>> Also, I would like to note that the #254 reproduces easily. I'm not sure
>> about the memory usage - I need to wait when the evaluation finishes (right
>> now ES takes 752m with 600 submissions evaluated and 24000 submissions
>> being evaluated).
>>
>> By now the situation with database connections is like the one in
>> attachment. The number do not change if I stop stress testing (but probably
>> I don't wait too long after stopping). If you think that there is something
>> unusual, I can send you logs of CWS, but could you give me your public GPG
>> key for them?
>>
>>
>>
>> On Tue, Apr 29, 2014 at 6:52 PM, Artem Iglikov 
>> <artem.iglikov@xxxxxxxxx>wrote:
>>
>>> I was quite busy these days so haven't done any additional tests, but
>>> I'll try to reproduce the issue today with your patch applied. Thanks.
>>>  On Apr 29, 2014 6:42 PM, "Luca Wehrstedt" <luca.wehrstedt@xxxxxxxxx>
>>> wrote:
>>>
>>>> I'd like to fix the database connections issue but neither I nor
>>>> Giovanni have been able to reproduce it. We need to diagnose it on your
>>>> system, I'm sorry.
>>>>
>>>> Could you please apply again the patch I'm attaching, manually start
>>>> CWS from the shell, redirecting its standard output & error to file,
>>>> reproduce the issue and send us that file? We need the stdout+stderr, as
>>>> it's the only place where a detailed access log is available (each request,
>>>> with URL and other info). Thanks for your help!
>>>>
>>>> Luca
>>>>
>>>> PS: the extreme memory use is also unexpected; as it may be related to
>>>> the connection issue I'll tackle it after we've solved this.
>>>>
>>>>
>>>> On Wed, Apr 23, 2014 at 9:36 AM, Artem Iglikov <artem.iglikov@xxxxxxxxx
>>>> > wrote:
>>>>
>>>>> Thank you guys for analysing the situation.
>>>>>
>>>>> Just a small correction, my estimation of 20000 submits is for
>>>>> overall number of submits made during all virtual contests, which will be
>>>>> distributed more or less evenly during 1 or even 2 days.
>>>>>
>>>>> And stress testing from master doesn't work again :-) The last commit
>>>>> broke something in storing logs, I suppose.
>>>>>
>>>>> I did a stress testing with 400 actors (4 instances of StressTest.py
>>>>> with 100 actors each), 12 workers, about 20000 submits in total and 
>>>>> default
>>>>> database settings without PgBouncer, and seemed fine except these:
>>>>> - I've got exception from #254 a few times
>>>>> - I've ran out of db connections (which was expected)
>>>>>  - possibly because of problems with db connections two workers died.
>>>>> I saw that they were in "compiling" state for about 10 minutes (though in
>>>>> their console I saw that the job was already done) and after that they
>>>>> became disabled. I guess they were not able to deliver the result back to
>>>>> ES and because of these ES kicked them (maybe I'm wrong)
>>>>> - AWS overview page is not supposed to handle a very large queue, each
>>>>> refresh forces ES to use 100% of CPU. This obviously shouldn't be the
>>>>> case during real contest if everything goes as expected, but I'm
>>>>> going to fix this anyway, partially done here:
>>>>> https://github.com/artikz/cms/commit/50a1c3235a374bf5695178058c27c4e798c1f096
>>>>> .
>>>>>
>>>>> Then I've repeated the stress test, but now tweaked database
>>>>> max_connections limit to 200 (also had to increase SHMMAX). And only
>>>>> problems that left were #254 and AWS overview page.
>>>>>
>>>>> All these problems seems minor ones to me except one: I couldn't get a
>>>>> "dead" worker back alive. Restarting worker doesn't help, seems only
>>>>> restarting ES (or another core service) works. Is it for some reason?
>>>>> Shouldn't the "death" state of a worker be cleared when worker reconnects?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 22, 2014 at 10:26 PM, Giovanni Mascellani <
>>>>> mascellani@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>> Il 22/04/2014 18:23, Luca Chiodini ha scritto:
>>>>>> >> For instance, a few weeks ago Luca Chiodini complained on this
>>>>>> >> mailing list that StressTest had a problem, but I didn't have time
>>>>>> to
>>>>>> >> check it out and most probably I won't in the near future.
>>>>>> >
>>>>>> > I did, but Artem has already fixed it with #265 [0]
>>>>>> > and now StessTest works fine.
>>>>>>
>>>>>> Yes, sorry, I remembered about that just after having sent my reply.
>>>>>>
>>>>>> Gio.
>>>>>> --
>>>>>> Giovanni Mascellani <giovanni.mascellani@xxxxxx>
>>>>>> PhD Student - Scuola Normale Superiore, Pisa, Italy
>>>>>>
>>>>>> http://poisson.phc.unipi.it/~mascellani
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Artem Iglikov
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Artem Iglikov
>>
>
>
>
> --
> Artem Iglikov
>



-- 
Artem Iglikov

Other related posts: