[haiku-gsoc] Re: Full Text Search and Indexing -- Looking for opinions/comments

  • From: "François Revol" <revol@xxxxxxx>
  • To: haiku-gsoc@xxxxxxxxxxxxx
  • Date: Mon, 25 May 2009 20:44:26 +0200 CEST

> >> The indexing daemon will have a set of plugins that will convert
> > > data
> >> from different file formats (PDF, ODF, DOC etc.) to a format
> >> compatible with the indexing library.

Did you check the patch I posted some time ago about BaseTranslator ?
Actually I noticed a missing break somewhere, beware.

> >
> > As I'm sure you are aware, this is just like Spotlight (and of
> > course
> > is a fine idea.) In the article I linked above it mentioned that
> > Spotlight (and apparently Google too) only care about the first
> > 100k
> > bytes of each file. So that would probably be a good limit to
> > impose
> > on our system. No reason to completely process multi-megabyte PDFs,
> > etc.
> >
>
> I don't think thats a good idea since many PDFs are multi megabyte
> PDFs today (because of embedded graphics and fonts and stuff) and I
> still want to be able to find a word if it is in the last chapter of
> such a pdf

Well you'd only index the text from it anyway...

François.

Other related posts: