[recoll-user] Re: Fwd: Exclude all file but pdf?

  • From: jfd@xxxxxxxxxx
  • To: recoll-user@xxxxxxxxxxxxx
  • Date: Fri, 16 Sep 2011 17:16:19 +0200

Krisoijn Chan writes:
 > After playing with recoll and doing some research on google.
 > 
 > I found
 > http://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
 > 
 > A simple script is all I need which is base on the link above.

Just for the record, there are several ways to index only pdf files with
Recoll, one of which would be to use the "indexedmimetypes" configuration
variable which exists just for this purpose. In ~/.recoll/recoll.conf:

indexedmimetypes = application/pdf

Another approach which would work but doesn't make much sense would be:
 find $topdir -name '*.pdf'" | recollindex -i

Using recoll would probably be simpler and offer richer function than the
"simple" script / grep combination.

Cheers,
jf


 > A simple script is all I need which is base on the link above.
 > 
 > #######################################################################
 > # require pdftotext
 > 
 > search_dir=$HOME/tmp/pdf
 > cache_dir=$HOME/tmp/cache/pdf2text
 > 
 > mkdir -pv "$cache_dir"
 > 
 > find "$search_dir" -type f -name \*.pdf | while read file; do
 >   md5sum=$(md5sum "$file" | cut -d\  -f1)
 >   file_sed=$(echo "$(basename "$file")" | sed -e s"/[^a-zA-Z0-9]/-/"g)
 >   cache_file="$cache_dir/$file_sed-$md5sum"
 > 
 >   # run pdftotext only if cache file is not exist already
 >   ls "$cache_dir/*-$md5sum" > /dev/null 2>&1 || pdftotext "$file"
 > "$cache_file"
 >   grep --color=always "$1" "$cache_file"
 > done
 > ########################################################################
 > 
 > marked it as solved then, thanks!
 > 
 > - kris


Other related posts: