On Sat, May 23, 2009 at 9:33 AM, Ankur Sethi <get.me.ankur@xxxxxxxxx> wrote: > > I have been looking around for already available information retrieval > libraries. The two major projects I found are CLucene and Xapian. > CLucene is the more popular of the two, but I think building it will > require GCC4. Xapian is under the GPL, so I don't know if that will be > acceptable. The last option is, of course, writing one from scratch, > which may not be a good idea given the project timeline. In general I do not recommend rewriting something if there are already good options. As Axel and Ingo have said, needing GCC4 should not be an issue, so CLucene is probably a good choice. I know the Java version and the clone in Ruby (Ferret) are pretty popular. > 2. The Indexing Daemon: Will keep the database in sync as files change > on disk. It's starting to dawn on me that this might be an area that > would require a lot of thought. I have probably said this before, but I'll say it again: PLEASE be sure that this daemon (or server in the BeOS/Haiku vocabulary) does not constantly grind the harddrive or otherwise impede the user experience. It probably goes without saying, but my experience with indexing tools in Windows and Linux has not been good. I hate hearing my harddrive going crazy and having the system feel sluggish from all the IO. Mac OS X Spotlight seems to do it right (probably no surprise since the creator of the BFS, Dominic Giampaolo, worked on Spotlight, http://daringfireball.net/2004/07/spotlight_on_spotlight.) Haiku may have some advantage in this area because of what BFS provides as well as the kernel monitoring services. The trick will be finding the right mix between keeping things updated all the time (which will slow down the overall throughput and speed of the system), and periodically doing updates. It would probably be a good idea to make this tunable in some way, so you (and others) can literally test and find the sweet spot. It might also be smart to eventually tie into the future power management framework and perform batch updates when the user is definitely idle (and it must be able to stop immediately when the user returns.) Though having a very low priority for your updating thread(s) may be fine too. Another thing that would be cool is to have the database already set up for a default Haiku install. This means that default applications, documentation, etc will be immediately indexed and searchable out of the box. Combining this with the right setting in the tuning described above and the user may NEVER have to endure a long period of harddrive scanning. > The indexing daemon will have a set of plugins that will convert data > from different file formats (PDF, ODF, DOC etc.) to a format > compatible with the indexing library. As I'm sure you are aware, this is just like Spotlight (and of course is a fine idea.) In the article I linked above it mentioned that Spotlight (and apparently Google too) only care about the first 100k bytes of each file. So that would probably be a good limit to impose on our system. No reason to completely process multi-megabyte PDFs, etc. > 3. A Set of Commandine Tools: Refreshing the database, querying the > database from the commandline, forcing updates etc. > > 4. The GUI Front End: Full text indexing functionality will be > integrated into the already existing search UI. I was also thinking of > a very minimal UI that could be quickly brought up/dismissed with a > simple keystroke. Sounds good. Regards, Ryan