[dokuwiki] Re: New meta data system

From: Michael Hamann <michael@xxxxxxxxxxxxxxxx>
To: dokuwiki <dokuwiki@xxxxxxxxxxxxx>
Date: Tue, 16 Nov 2010 11:33:17 +0100
Hi,

Excerpts from TNHarris's message of 2010-11-16 02:03:44 +0100:
> On 11/15/2010 05:48 AM, Michael Hamann wrote:
> >
> > Additionally we want to have a quick way to search for certain criterias
> > in meta files so e.g. tagging could be implemented easily without
> > additional structures. We had a look at the current DokuWiki indexer and
> > have some quite concrete ideas how it can be used for metadata.
> 
> I'd be interested in seeing these "concrete ideas".
> 
> I'm working at making improvements to the indexer[1]. The indexer I 
> think should be a class with proper separation of concerns. And there 
> would be the option of completely replacing it. (With Xapian, for example).

The problem with Xapian is that afaik there is no pure PHP version of
Xapian which means DokuWiki won't run on normal webspaces anymore when
we were using Xapian, so this doesn't seem to be an option. The only
indexer I know that's powerful and available for PHP is Lucene. We would
need to some tests in order to see if the memory consumption is
acceptable.

The problem with refactoring the indexer is that there are plugins like
docsearch (http://www.dokuwiki.org/plugin:docsearch) that are using the
indexer API. I'm not sure if there are other plugins, perhaps we could
check that and see which parts are concerned (and perhaps also just
change these plugins).

I've also looked at the indexer and found that there are some points
where the performance can be improved. My changes can be found at
https://github.com/michitux/dokuwiki, from what I've seen so far we've
changed different parts so our changes probably won't conflict. I've
written a very simple benchmark that indexes a directory of pages
recursively and can be found at
https://github.com/michitux/dokuwiki-benchmark-indexer. I've used a dump
of dokuwiki.org for testing and in various tests I've seen that my
changes make the parser around twice as fast as the current one. I
haven't pushed them to the main repository as I want to do some further
tests with different parts of the changes applied and different settings
in order to see which parts really improve the performance and which
don't. The memory usage remained almost the same although I could
imagine that one of my changes increases the memory usage and I've just
seen you once did the exactly opposite change in order to save memory.
Did you do some tests in order to show it really increases the memory
usage significantly? If you could have a look at these changes and
perhaps also run some tests that would be really helpful. I also don't
know why _freadline() has been introduced (instead of fgets()) and if
there were other reasons than old PHP versions that didn't read a full
line when the length parameter was omitted.

> Right now I'm just looking at the tokenizer. Chinese and Japanese texts 
> need a morphological analyzer to find word breaks. I'm putting in an 
> option to use an external tokenizer such as MeCab[2]. I don't think it 
> should be a plugin. The function is too low-level and having to maintain 
> an action hook would make it difficult to change the indexer. And using 
> a different external tokenizer is just a matter of changing the exec 
> call. Although there is one PHP extension that I know of.

That might make sense, but as the tokenizer isn't called that often
anymore with your changes a hook that gets the text and the stopwords
might be an option, too.

> I have a question for Kazutaka Miyasaka, who rewrote the fulltext term 
> parser.
> 
> function ft_termParser($term, &$stopwords,
>                         $consider_asian = true,
>                         $phrase_mode = false) {
>      $parsed = '';
>      if ($consider_asian) {
>          // successive asian characters need to be searched as a phrase
>          $words = preg_split('/('.IDX_ASIAN.'+)/u', $term, -1,
>                        PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
>          foreach ($words as $word) {
>              if (preg_match('/'.IDX_ASIAN.'/u', $word))
>                 $phrase_mode = true;
>              $parsed .= ft_termParser($word, $stopwords, false,
>                                       $phrase_mode);
>          }
>      } else {
> //... etc
> 
> In the foreach loop, $phrase_mode is changed if there is an Asian 
> character in the string, then ft_termParser is called recursively. But I 
> think it's supposed to only be changing the value for the call, and not 
> for successive iterations. Shouldn't it instead be... ?
> 
> $parsed .= ft_termParser($word, $stopwords, false,
>                $phrase_mode || preg_match('/'.IDX_ASIAN.'/u', $word));

Yes, I guess that too, but I'm not even sure if we need that
"$phrase_mode ||" at all.

> But I'm not sure if I'm understanding the function correctly. 
> Additionally, some unit tests for the fulltext engine will help with 
> regressions as the indexer changes.

Yes, that would really be helpful. The problem is a bit that we would
need some (or even a lot of) test data in order to do these tests.
Unfortunately the content on dokuwiki.org is still under a noncommercial
license and therefore can't be included in unit tests. If that's changed
(I hope that will happen rather sooner than later) we could select a
part of it for doing tests also with content in different languages that
could also measure the indexer performance and one could verify that
the indexer always produces the same output (if it wasn't changed
intentionally).

So back to topic: I hoped my email would make it clear how we want to use
the indexer, but it seems it wasn't that clear so I'll go further into
details of what we discussed. Basically we don't really want to use all
of the indexer functions (or change them to be more generic) but in fact
we want to use the file format. For every meta property that shall be
indexed we probably need a flag if it's a single value, the keys of an
array or the values of an array. They shall be collected in an array
that is passed through an event in order to allow plugins to allow their
own indexes.

For each of these fields to be indexed a file is created with all values
like for words and then another index file like the "i"-files we
currently have is created where for every value the number of every page
that has that value is noted. Additionally a backwards index is created
that contains for every page all values. As count probably won't matter
just using linenumber:linenumber:linenumber:... should be enough.

With these index files it should be possible to do a search in meta
values like it is done for the text currently.

I hope it's now a bit clearer what the idea of the new metadata system
is.

Michael

PS: I think we should work together on improving the indexer and should
make sure we don't create changes that conflict. Perhaps we could work
together in the same repository, if you want I could give you commit
access to mine.
-- 
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist
Follow-Ups:
- [dokuwiki] Re: New meta data system
  - From: TNHarris
References:
- [dokuwiki] New meta data system
  - From: Michael Hamann
- [dokuwiki] Re: New meta data system
  - From: TNHarris
[dokuwiki] Re: New meta data system

Other related posts: