Axel Dörfler wrote on Sat, 03 May 2008 13:13:26 +0200 CEST: > "Alexander G. M. Smith" <agmsmith@xxxxxxxxxx> wrote: > > The applications (or a background thread running a Translator) > > would have to be adapted to write a standard "META:keywords" > > attribute with all the keywords for a modified document. > > > > The file system would need only a few changes. When adding an > > attribute to the index, each word in the attribute is added > > separately to the index, with a back pointer to the file as usual. > > Thus the file is in the index multiple times. Deleting the > > keywords is similar to adding. Searching (and live queries) is a > > bit different, since you may get the same file multiple times. The > > simple solution is to have user level code filter out the > > duplicates. > > But that would result in a pretty bad solution: first of all what > keywords? And second of all, the BFS query mechanism hasn't been > written for arbitrary matches, it has been written for direct lookup. > Using pattern matching (which Tracker does by default) makes searching > anything really slow. Having many keywords put into those indices would > slow it down considerably more. > And it would still not even work like what a user expects (similarity > search, ignoring diacritics, etc.). I can answer most of those objections. The application decides on the keywords. When it saves a file, it also updates the META:keywords attribute (or META:keyw or whatever it will be). It can also simplify the search by converting the keywords to lower case without diacritical marks before saving them in the attribute. Searching is faster with every word in the index. If I look for "META:keywords=happy*" then the index can quickly jump to the range of files starting with "happy" (iterates from "happy" to "happz"). It doesn't iterate through all files! François suggested using the whole keyword string as the entry in the index, as it is now. To find happy things you'd search for the keyword anywhere in the middle of the string, using "META:keywords=*happy*". This would indeed have to iterate through all of the index, which would be slow. That's the advantage of having each keyword added separately to the index. I only see a couple of disadvantages to my idea. One is the need to filter out multiple returns of the same file in a query. The other is that there are three copies of each word: one in the original document, one in the attribute and one in the index. A separate index server (like MacOS SpotLight) reduces that to two copies of each word. In any case, having a META:keywords attribute would be useful by itself even if it is badly indexed. It's much faster to search through strings in an index than opening all the files on the drive and looking at their attributes. - Alex P.S. I'd like to search my old e-mails for key words, and grep is just too annoyingly slow!