[recoll-user] Re: peaks in IO that freeze the machine

  • From: Ramon Diaz-Uriarte <rdiaz02@xxxxxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Sun, 04 Sep 2011 20:08:03 +0200



On Fri, 2 Sep 2011 16:01:44 +0200,jfd@xxxxxxxxxx wrote:
> Ramon Diaz-Uriarte writes:
>  > Dear All,
>  > 
>  > I've been a very happy user of recoll. Lately, however, I've started to
>  > notice that I periodically get peaks in IO that freeze my systems.
>  > 
>  > 
>  > Details:
>  > 
>  > - I index my home directory (my recoll.conf is at the bottom) using real
>  >   time indexing. (I also monitor /usr/share/doc with another recollindex,
>  >   but that rarely causes trouble).  
>  > 
>  > - My current recoll version is 1.15.9-1 (as packed by Debian). 
>  > 
>  > - I am monitoring IO usage with iotop. 
>  > 
>  > - From time to time (e.g., when I delete/refile messages in my Maildir, or
>  >   when offlineimap deletes messages) I can see recollindex doing lots of
>  >   IO (iotop shows up to 99%). 
>  > 
>  > - This seems to lead to many other processes being left waiting (as shown
>  >   by top).  For instance, while answering email, Emacs freezes and you
>  >   cannot see what your are typing. Or the command line freezes if you are
>  >   doing a simple "ls".
>  > 
>  > - This can last from barely less than a second to a few seconds, but it is
>  >   very intrusive.

> There is currently nothing in recoll which would prevent it from using all
> the IO bandwidth available on the system (it does use setpriority() to
> lower its cpu priority). I'll add an ionice() call as soon as this will be
> available in libc.

> This has actually already been reported, see: 

> https://bitbucket.org/medoc/recoll/issue/46/recoll-needs-ionice-and-nice

> As far as I know, nothing has changed about the ionice availability as a
> system call, so the advice remains the same, use the ionice command to
> start recollindex, for example:

> ionice -c3 recollindex -m

> or (for a higher precedence, and as suggested by the previous reporter):

> ionice -c2 -n7 recollindex -m


Ah, thanks. I should have looked at the Issues.


Interestingly, I have some of the same issues as reported there, such as
the system being extremely slow to open (or close) a simple terminal.


Anyway, I tried with

ionice -c2 -n7 recollindex -m

and things only improved a little bit. I then tried with 

ionice -c3 recollindex -m

and for the last two days things seem mostly OK (though I still get the
occasional slow down). 

Interestingly, in at least two cases, while experimenting with deleting
large directories, recollindex died (without any trace of why in the log)
after being "ioniced" with -c3.




>  > - The log output shows all sorts of messages, many of them "No such file
>  >   or directory", but not all. In particular, it does not seem related (as
>  >   far as I can tell) to processing some very huge text file.

> After looking at the screen copy, I'd guess that these "no such file or
> directory" messages are related to dangling symlinks, and have no
> relationship to the IO problem. To know more about what's happening while
> recollindex is thrashing the system, you could set a logfilename, raise the
> loglevel to 4 and have a look at what recollindex is actually doing.


>  > - This problem happens in several machines, from a tiny Asus netbook to a
>  >   quad-core workstation. All run Debian GNU/Linux and recent stock Debian
>  >   kernels (e.g., 2.6.39).
>  > 
>  > I am attaching a screen capture that shows (left to right, top to bottom):
>  >   * tail -f recolltrace
>  >   * iotop -o
>  >   * iotop -o -a
>  >   * top
>  > 
>  > it shows a case where IO from recollindex seems to do lots of IO and a
>  > large fraction of processes (23%, but in other cases as many as 50%) are
>  > waiting, I think because of IO.
>  > 
>  > 
>  > What am I doing wrong? Is there a way to avoid these problems?

> Thanks for reporting this and for all the detail, I don't think that you
> are doing anything wrong. 

> I'm not too good at interpreting iotop output, but it seems to report 2M/s
> of write output, which seems awfully low to saturate the IO system, except
> if these are small and synchronous writes. It would be interesting to have
> the average size of writes in addition to the aggregate transfer rate.

> I think that another reason why a process might be blocked is when waiting
> for a page-in, which may happen if recollindex is eating a lot of memory.
> It would be interesting to know how big recollindex has become at this
> point. Another user has reported that recollindex becomes very big when
> deleting many documents, but I have not seen this happen myself. 


Some other issues related to memory, etc:

1. In a large workstation with 64 GB of RAM, I was seeing the same
problems, so memory is unlikely to be the problem.


2. In that workstation, with two disks, it seems I eliminated the problem
when I configured recoll to write the DBs to the second hard drive. So the
problem seems to be one of writing of the indices. 

3. (In the laptop) top rarely reports more than 1.5% of the memory being
used (total virtual size around 195000 kb) during "normal" operation.


4. I tried several operations of creating/deleteing large
directories. When indexing the newly added files, recollindex can get up
to a virtual size of around 1000000 kb, which is not too bad (not more
than 5% memory) which seems associated to processing very large text files
and flusing (log shows something like

:4:../rcldb/rcldb.cpp:1156:Db::add: docid 684513 added
[/home/ramon/test-recoll/JorgeMartin/NormaDATA Expresion 5.zip|NormaDATA
Expresion 5/An�lisis_MDA231-shRNALacZ-Doxy vs
MDA231-shRKH2eDZhMxLxwTFgtM1EmLA]

:4:../rcldb/rcldb.cpp:1180:Db::add: text size >= 10 Mb, flushing


or output in the log such as

:4:rclmonrcv.cpp:568:RclIntf::getEvent: unhandled event IN_IGNORED 0x8000
/home/ramon/.gconf/apps/nm-applet/ignore-ca-cer




5. I also see what you mention has been reported before: When deleting
many files, recollindex can really use a lot of memory. For instance, this
is the output from top


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
                        
29832 ramon     39  19 4071m 3.9g 2664 D   25 50.6  33:15.47 recollindex        
                         

and this was in the log:

log file
:4:../rcldb/rcldb.cpp:1392:Db::purgeFile: delete subdoc 684513


Apparently, large memory consumption is related to deletion of directories
that contain many large files rather than large directories with mostly
small files. I am not sure, though; peaks in memory usage are often
brief. 



> Also, depending on tuning. some systems will reclaim pages from
> sleeping processes when doing IO, even when there would be enough available
> memory to avoid it. This would signal itself by a "wakeup" time for
> interactive processes which have been sleeping, after recollindex has been
> active. I mention this for reference only, as you report having trouble
> while actively working.

> Unfortunately, I haven't looked at paging issues since I had to, when
> sunos4 replaced sunos3 in the 90's :), and I declare myself incompetent
> about the details of what to look at.

> I'd be quite interested to learn if the ionice approach improves the issue
> for you, I could then at least document it, until I can use a better
> integrated approach.

> And also the recollindex size.

What other type of info would be useful to report here?


Best, 


R.


> Cheers,

> jf


-- 
Ramon Diaz-Uriarte
Department of Biochemistry
Universidad Autónoma de Madrid
Spain

http://ligarto.org/rdiaz

Temporarily at:
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
 
Phone: +34-91-732-8000 ext. 3019
Fax: +-34-91-224-6972

Other related posts: