Hi, Excerpts from TNHarris's message of 2010-11-16 02:03:44 +0100: > On 11/15/2010 05:48 AM, Michael Hamann wrote: > > > > Additionally we want to have a quick way to search for certain criterias > > in meta files so e.g. tagging could be implemented easily without > > additional structures. We had a look at the current DokuWiki indexer and > > have some quite concrete ideas how it can be used for metadata. > > I'd be interested in seeing these "concrete ideas". > > I'm working at making improvements to the indexer[1]. The indexer I > think should be a class with proper separation of concerns. And there > would be the option of completely replacing it. (With Xapian, for example). The problem with Xapian is that afaik there is no pure PHP version of Xapian which means DokuWiki won't run on normal webspaces anymore when we were using Xapian, so this doesn't seem to be an option. The only indexer I know that's powerful and available for PHP is Lucene. We would need to some tests in order to see if the memory consumption is acceptable. The problem with refactoring the indexer is that there are plugins like docsearch (http://www.dokuwiki.org/plugin:docsearch) that are using the indexer API. I'm not sure if there are other plugins, perhaps we could check that and see which parts are concerned (and perhaps also just change these plugins). I've also looked at the indexer and found that there are some points where the performance can be improved. My changes can be found at https://github.com/michitux/dokuwiki, from what I've seen so far we've changed different parts so our changes probably won't conflict. I've written a very simple benchmark that indexes a directory of pages recursively and can be found at https://github.com/michitux/dokuwiki-benchmark-indexer. I've used a dump of dokuwiki.org for testing and in various tests I've seen that my changes make the parser around twice as fast as the current one. I haven't pushed them to the main repository as I want to do some further tests with different parts of the changes applied and different settings in order to see which parts really improve the performance and which don't. The memory usage remained almost the same although I could imagine that one of my changes increases the memory usage and I've just seen you once did the exactly opposite change in order to save memory. Did you do some tests in order to show it really increases the memory usage significantly? If you could have a look at these changes and perhaps also run some tests that would be really helpful. I also don't know why _freadline() has been introduced (instead of fgets()) and if there were other reasons than old PHP versions that didn't read a full line when the length parameter was omitted. > Right now I'm just looking at the tokenizer. Chinese and Japanese texts > need a morphological analyzer to find word breaks. I'm putting in an > option to use an external tokenizer such as MeCab[2]. I don't think it > should be a plugin. The function is too low-level and having to maintain > an action hook would make it difficult to change the indexer. And using > a different external tokenizer is just a matter of changing the exec > call. Although there is one PHP extension that I know of. That might make sense, but as the tokenizer isn't called that often anymore with your changes a hook that gets the text and the stopwords might be an option, too. > I have a question for Kazutaka Miyasaka, who rewrote the fulltext term > parser. > > function ft_termParser($term, &$stopwords, > $consider_asian = true, > $phrase_mode = false) { > $parsed = ''; > if ($consider_asian) { > // successive asian characters need to be searched as a phrase > $words = preg_split('/('.IDX_ASIAN.'+)/u', $term, -1, > PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); > foreach ($words as $word) { > if (preg_match('/'.IDX_ASIAN.'/u', $word)) > $phrase_mode = true; > $parsed .= ft_termParser($word, $stopwords, false, > $phrase_mode); > } > } else { > //... etc > > In the foreach loop, $phrase_mode is changed if there is an Asian > character in the string, then ft_termParser is called recursively. But I > think it's supposed to only be changing the value for the call, and not > for successive iterations. Shouldn't it instead be... ? > > $parsed .= ft_termParser($word, $stopwords, false, > $phrase_mode || preg_match('/'.IDX_ASIAN.'/u', $word)); Yes, I guess that too, but I'm not even sure if we need that "$phrase_mode ||" at all. > But I'm not sure if I'm understanding the function correctly. > Additionally, some unit tests for the fulltext engine will help with > regressions as the indexer changes. Yes, that would really be helpful. The problem is a bit that we would need some (or even a lot of) test data in order to do these tests. Unfortunately the content on dokuwiki.org is still under a noncommercial license and therefore can't be included in unit tests. If that's changed (I hope that will happen rather sooner than later) we could select a part of it for doing tests also with content in different languages that could also measure the indexer performance and one could verify that the indexer always produces the same output (if it wasn't changed intentionally). So back to topic: I hoped my email would make it clear how we want to use the indexer, but it seems it wasn't that clear so I'll go further into details of what we discussed. Basically we don't really want to use all of the indexer functions (or change them to be more generic) but in fact we want to use the file format. For every meta property that shall be indexed we probably need a flag if it's a single value, the keys of an array or the values of an array. They shall be collected in an array that is passed through an event in order to allow plugins to allow their own indexes. For each of these fields to be indexed a file is created with all values like for words and then another index file like the "i"-files we currently have is created where for every value the number of every page that has that value is noted. Additionally a backwards index is created that contains for every page all values. As count probably won't matter just using linenumber:linenumber:linenumber:... should be enough. With these index files it should be possible to do a search in meta values like it is done for the text currently. I hope it's now a bit clearer what the idea of the new metadata system is. Michael PS: I think we should work together on improving the indexer and should make sure we don't create changes that conflict. Perhaps we could work together in the same repository, if you want I could give you commit access to mine. -- DokuWiki mailing list - more info at http://www.dokuwiki.org/mailinglist