[recoll-user] Re: Step-by-step walkthrough of interfacing Recoll with Searx.

From: Jean-Francois Dockes <jfd@xxxxxxxxxx>
To: recoll-user@xxxxxxxxxxxxx
Date: Sun, 08 May 2022 15:38:42 +0200

Hi,

Thanks for this interesting document !

The huge temporary files generated by the OCR are probably uncompressed PPM
images. pdftoppm was used by the original version of the code, because the
alternative tool, pdftocairo, which produces much smaller TIFF, was not
always available.

Recent versions of the handler use pdftocairo if it is available, so you should
check that the command is installed.

In addition, the handler sometimes did not delete the temporary files after
errors. This is supposedly fixed since December 2021.

As you noticed, creating big indexes takes time. At some point, it's
definitely a good idea to split them up. The main reason from my point of
view is that we occasionally see indexes becoming corrupted, and needing a
rebuild.

I'm really not sure if it's best to serialize building multiple indexes, or
to use a certain amount of parallelism. This is probably highly dependant
on the hardware configuration.

The reason why trying parallelism is worth it despite recollindex itself being
multithreaded is that the actual Xapian index updating is single-threaded, and
takes a significant portion of the indexing time, so that updating multiple
indexes
may lead to better CPU utilisation. This supposes that you have sufficient
hardware
to avoid other issues (spinning disk thrashing, bad cache utilisation if not
enough
RAM is available...).

Some issues may only manifest themselves when the indexes are actually big,
so experimenting is long and difficult. The results are also mostly not
reproducible
on different hardware configurations, so it's difficult to accumulate
experience on
the subject.

Cheers,

jf

The Doctor writes:

Hello, everyone.

A few months back I did a writeup of how to use Recoll to index a fairly
large volume of data
(on the order of multiple terabytes) and interface them with Searx
(https://github.com/searx/searx)
for ease of use (as well as API access). I don't remember if I posted it
here or not, so if I did
I
apologize in advance. And if not, I hope that this is an essay that can
help new Recoll users.

https://drwho.virtadpt.net/archive/2022-02-08/using-recoll-to-index-my-hoard/

The Doctor [412/724/301/703/415/510]
WWW: https://drwho.virtadpt.net/
The old world is dying, and the new world struggles to be born. Now is the
time of monsters.

Follow-Ups:
- [recoll-user] Re: Step-by-step walkthrough of interfacing Recoll with Searx.
  - From: The Doctor

References:
- [recoll-user] Step-by-step walkthrough of interfacing Recoll with Searx.
  - From: The Doctor

[recoll-user] Re: Step-by-step walkthrough of interfacing Recoll with Searx.

Other related posts: