[dokuwiki] Re: New meta data system

  • From: Michael Hamann <michael@xxxxxxxxxxxxxxxx>
  • To: dokuwiki <dokuwiki@xxxxxxxxxxxxx>
  • Date: Wed, 17 Nov 2010 12:08:07 +0100

Hi,

Excerpts from TNHarris's message of 2010-11-17 02:54:57 +0100:
> On 11/16/2010 05:33 AM, Michael Hamann wrote:
> >
> > I haven't pushed them to the main repository as I want to do some
> > further tests with different parts of the changes applied and
> > different settings in order to see which parts really improve the
> > performance and which don't. The memory usage remained almost the
> > same although I could imagine that one of my changes increases the
> > memory usage and I've just seen you once did the exactly opposite
> > change in order to save memory. Did you do some tests in order to
> > show it really increases the memory usage significantly? If you could
> > have a look at these changes and perhaps also run some tests that
> > would be really helpful. I also don't know why _freadline() has been
> > introduced (instead of fgets()) and if there were other reasons than
> > old PHP versions that didn't read a full line when the length
> > parameter was omitted.
> 
> At the time of b634459117 there were multiple bug reports of PHP running 
> out of memory. I just hacked at everything that looked moderately 
> hoggish until the errors went away. I think most of the reports were 
> coming from shared hosts where the memory limit was fairly small, less 
> than 32M I think in some cases. Even if w*.idx files are of manageable 
> size, the i*.idx files can grow to be quite large proportional to the 
> number of pages.

In my experiments with indexing the pages of dokuwiki.org the largest
i*.idx files has 604KB while pageword.idx - which was back then saved
using that idx_saveIndex function - has 3.8MB. But I think we/I should
do some more benchmarks to see if the performance improvement of that
change is really significant, otherwise I would suggest to leave it as
it is.

> And yes, _freadline was to maintain compatibility with PHP 4.2 which was 
> the baseline back then.

Okay, so it can be safely removed now as fgets() alone is a lot faster.

> Regarding the regex improvement, you're adding the changed pid*count 
> entry at the beginning of the line, where previously it was being moved 
> to the end. I'm not sure if it makes a difference, but the paranoid side 
> of me would rather retain the old behavior.

As that idx_updateIndexLine function is called quite often I wanted to
avoid dealing with that "\n" at the end of every line. But there is also
the other argument that when the old behavior is kept it is easier to
verify that the new index is really the same as the old one. I'll
probably rewrite that change to keep the old behavior although I'm
pretty sure it doesn't make any difference.

> > The problem with Xapian is that afaik there is no pure PHP version of
> > Xapian which means DokuWiki won't run on normal webspaces anymore
> >
> 
> I'm thinking of the option to allow a plugin to change the engine. The 
> current spaghetti state of inc/indexer.php precludes that. Most of it 
> would have to be rewritten as a class. Some of the old functions can be 
> kept around for compatibility, but not as a crutch.

That sounds like a good solution as then wikis with a lot of pages that
probably have enough memory etc. anyway could use another indexer.

> > They shall be collected in an array that is passed through an event
> > in order to allow plugins to allow their own indexes.
> 
> Is this redundant? Syntax plugins add custom metadata in the render 
> method. The renderer could have an "indexed" array like "persistent" 
> that plugins write to to be searchable. (Indexable values are not 
> persistent, I think?) Unless your plugin doesn't have a syntax 
> component, but where does the metadata come from then?

The idea behind that is that indexes that are no longer needed because
e.g. the plugin has be uninstalled won't be updated anymore. As we
currently don't have any way to remove metadata from uninstalled plugins
and it will also be difficult to do that as multiple plugins could share
the same metadata we (Andi and me iirc) thought that such an event would
be the best solution.

> > For each of these fields to be indexed a file is created with all values
> > like for words and then another index file like the "i"-files we
> > currently have is created where for every value the number of every page
> > that has that value is noted. Additionally a backwards index is created
> > that contains for every page all values. As count probably won't matter
> > just using linenumber:linenumber:linenumber:... should be enough.
> >
> 
> It should be safe to use the same word IDs (w*.idx).

The problem is that we also need to have a list of all values of a
certain metadata key in order to be able to generated things like
tagclouds.

>  > For each of these keys an index file shall be created
> 
> Does this mean each metadata key gets its own set of i*.idx files? 
> Another approach I had thought of was combining the page ID with a 
> metadata ID like "2.1" for the first metadata key of the second page. 
> Then this is treated as the current page ID is.

That sounds like an interesting approach. We've discussed about that
issue in the irc yesterday, too, and there we've also discussed about
how metadata shall be integrated in the search. If it was simply
included in the main index it could be integrated without much effort,
another idea would be to have sections for that below the normal search
results like displayed on
http://ickewiki.de/_media/dokumentation/tagging_search.png.

> What would the API for metadata searches be like? If I wanted to find 
> pages with a "relation media" key (but not "relation references") which 
> contains "wikimedia.org" how would I do that?

That's a good question and something we didn't really think about in
that detail. We thought about something like 
searchMeta($key, $value, $ns=null), but of course matching partial urls
might be good, too. Btw. currently in some cases the backlinks are
already added to the index as you certainly know, but until your changes
they were also processed by the tokenizer which would of course remove
stopwords etc. We should also think about external indexers that might
just index the whole page or even the html produced by the renderer and
use things like stemming if we want to allow such things (which would of
course be cool). On the other hand most indexers allow to index certain
extra fields, so metadata and the main content should be somehow
separated (though we could of course just add them to the main index in
our indexer). What's your suggestion for allowing such partial matches?
I think your idea with that metadata ID (or even another index file for
defining these IDs by putting one key on one line) would be a good
solution for search, but I can't really imagine creating something like
a tag cloud efficiently from such an index.

> Why not use my fork? (facetiously) Of course it doesn't matter where it 
> goes. I can merge into your branch if you'd like.

I can also merge the changes into your fork, if you like that more.

> I've pushed the change that allows external tokenizers. It seems to work 
> well enough, although the word lists won't be compatible. But that's why 
> I changed the version check to require a match.

Apart from a bit broken indentation it looks good, though I haven't
tested it.

Michael
-- 
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Other related posts: