[recoll-user] Re: peaks in IO that freeze the machine

  • From: jfd@xxxxxxxxxx
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Fri, 2 Sep 2011 16:01:44 +0200

Ramon Diaz-Uriarte writes:
 > Dear All,
 > 
 > I've been a very happy user of recoll. Lately, however, I've started to
 > notice that I periodically get peaks in IO that freeze my systems.
 > 
 > 
 > Details:
 > 
 > - I index my home directory (my recoll.conf is at the bottom) using real
 >   time indexing. (I also monitor /usr/share/doc with another recollindex,
 >   but that rarely causes trouble).    
 > 
 > - My current recoll version is 1.15.9-1 (as packed by Debian). 
 > 
 > - I am monitoring IO usage with iotop. 
 > 
 > - From time to time (e.g., when I delete/refile messages in my Maildir, or
 >   when offlineimap deletes messages) I can see recollindex doing lots of
 >   IO (iotop shows up to 99%). 
 > 
 > - This seems to lead to many other processes being left waiting (as shown
 >   by top).  For instance, while answering email, Emacs freezes and you
 >   cannot see what your are typing. Or the command line freezes if you are
 >   doing a simple "ls".
 > 
 > - This can last from barely less than a second to a few seconds, but it is
 >   very intrusive.

There is currently nothing in recoll which would prevent it from using all
the IO bandwidth available on the system (it does use setpriority() to
lower its cpu priority). I'll add an ionice() call as soon as this will be
available in libc.

This has actually already been reported, see: 

https://bitbucket.org/medoc/recoll/issue/46/recoll-needs-ionice-and-nice

As far as I know, nothing has changed about the ionice availability as a
system call, so the advice remains the same, use the ionice command to
start recollindex, for example:

ionice -c3 recollindex -m

or (for a higher precedence, and as suggested by the previous reporter):

ionice -c2 -n7 recollindex -m


 > - The log output shows all sorts of messages, many of them "No such file
 >   or directory", but not all. In particular, it does not seem related (as
 >   far as I can tell) to processing some very huge text file.

After looking at the screen copy, I'd guess that these "no such file or
directory" messages are related to dangling symlinks, and have no
relationship to the IO problem. To know more about what's happening while
recollindex is thrashing the system, you could set a logfilename, raise the
loglevel to 4 and have a look at what recollindex is actually doing.

 > - This problem happens in several machines, from a tiny Asus netbook to a
 >   quad-core workstation. All run Debian GNU/Linux and recent stock Debian
 >   kernels (e.g., 2.6.39).
 > 
 > I am attaching a screen capture that shows (left to right, top to bottom):
 >   * tail -f recolltrace
 >   * iotop -o
 >   * iotop -o -a
 >   * top
 > 
 > it shows a case where IO from recollindex seems to do lots of IO and a
 > large fraction of processes (23%, but in other cases as many as 50%) are
 > waiting, I think because of IO.
 > 
 > 
 > What am I doing wrong? Is there a way to avoid these problems?

Thanks for reporting this and for all the detail, I don't think that you
are doing anything wrong. 

I'm not too good at interpreting iotop output, but it seems to report 2M/s
of write output, which seems awfully low to saturate the IO system, except
if these are small and synchronous writes. It would be interesting to have
the average size of writes in addition to the aggregate transfer rate.

I think that another reason why a process might be blocked is when waiting
for a page-in, which may happen if recollindex is eating a lot of memory.
It would be interesting to know how big recollindex has become at this
point. Another user has reported that recollindex becomes very big when
deleting many documents, but I have not seen this happen myself. 

Also, depending on tuning. some systems will reclaim pages from
sleeping processes when doing IO, even when there would be enough available
memory to avoid it. This would signal itself by a "wakeup" time for
interactive processes which have been sleeping, after recollindex has been
active. I mention this for reference only, as you report having trouble
while actively working.

Unfortunately, I haven't looked at paging issues since I had to, when
sunos4 replaced sunos3 in the 90's :), and I declare myself incompetent
about the details of what to look at.

I'd be quite interested to learn if the ionice approach improves the issue
for you, I could then at least document it, until I can use a better
integrated approach.

And also the recollindex size.

Cheers,

jf

Other related posts: