François Revol wrote: >> Ankur Sethi wrote: >> >>> (1) Create an initial index of all the data on the disk. This takes >>> a >>> *very* long time and consumes a large amount of CPU. Somethimes, >>> seemingly at random, Spotlight will decide to build the entire >>> index from scratch, and then there is very little you can do about it. >> >> This is one place where the BFS indexing shines. A userspace process >> that tries to keep an index of data on the disk needs to reindex from >> scratch every time it starts up if it wants to be absolutely sure its >> database is up-to-date. The reason is that it cannot know how much >> changed on disk while the process was not running. BFS does not have >> this problem because, obviously, no attributes can ever be written >> without BFS noticing. > > The problem is BFS cannot index string attributes larger than 255 > bytes; so this cannot be used for full content indexing. Full text indexing does not work by storing the full content directly in the index. Each term must be indexed independently, otherwise the lookup won't benefit much from the index. Of course, repeated terms are only added once and words that are too common or too short are ignored. This means that the size limit is not the big problem. What is needed is a mechanism for storing multiple strings in each attribute. > Now, I don't see full content indexing as really mandatory. I believe > well weighted keyword extraction should be enough (there is already a > META:keyw attribute defined somewhere, for People files IIRC). Keyword and label-type attributes also need multiple strings. Simply setting META:keyw to "christmas holiday france" is not good enough. Queries then need to use wildcards, which will give terrible performance. Generally when searching an index, wildcards at the end are ok, but prefix wildcards mean that you have to sequentially scan the whole index. > A "spotlight" like app would then just start several queries at once > and merge relevant results. Yes, for each extractor plugin there would be a search plugin. For instance, the one for mp3 files knows that the attributes for artist, album, title, year, etc should be included in the query, and that year is a number.