[recoll-user] Re: Memory usage for purging

  • From: Theo Wollenleben <alpha0x89@xxxxxxxx>
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Sun, 4 Sep 2011 13:54:54 +0200

Am Montag, 22. August 2011 schrieb jfd@xxxxxxxxxx:
> Theo Wollenleben writes:
>  > I moved a directory tree containing most of my indexed files. Updating
>  > the index almost doubled its size. Now there are 28.6 GB of index files
>  > under the directory xapiandb. At the end of the index update, while the
>  > status bar shows "Indexing in progress: Purge", the recoll process
>  > starts consuming all of the available memory until swapping to disk
>  > begins (recoll apparently needs more than 3 GB for purging my index). I
>  > tried to let it finish but eventually killed the recoll process after a
>  > few hours. Is there a way to purge the index without excessive memory
>  > usage?
> 
> It is normal that renaming the main directory would double the index size
> as the renamed files will be indexed as new before the purge phase will
> delete the old data. Recoll has no concept for renaming or moving
> files.

Is it also normal that the index is still twice as large, even after the 
purging has finished successfully?

> But I've really got no idea of why the purge phase is using a lot of
> memory. It is normally a simple loop to delete the documents that don't
> exist any more, just a repeated Xapian "delete" call.

I observed that the memory usage of the recoll process increased averagely by 
a few hundred kilobytes for every deleted document (for every "Db::purge: 
deleted document" message), which is about the size of the text per file to be 
indexed.

> I'd like to have a better suggestion, but the only idea which comes to
> mind is to just delete the xapiandb directory and reindex. I do realize
> that regenerating a dozen GB of index is no fun, but I just have no other
> idea about what to do.

Since mostly the easiest way is not the funniest, I instead hacked the file 
rcldb/rcldb.cpp to let recoll delete only a certain number of documents from 
the index and ran the update procedure several times. While doing so I made 
another observation. While recoll walks the directory tree I now get messages 
"Indexing in progress: (Files [...]/46127) /[...]" on the status bar, so I 
suppose there are 46127 documents in the index. This number was greater before 
and decreased with every index update using the hacked rcldb.cpp. But once 
having reached the count #38999 the purging will always stop with the message

:5:../rcldb/rcldb.cpp:1350:Db::purge: document #38999 not found

So I'm stuck with that number of 46127 documents (even when using the original 
rcldb.cpp), though I have less then 30000 files to be indexed.

Other related posts: