[dokuwiki] New meta data system

Hi everybody,

during the hackfest at the WikiFest Berlin we discussed a lot about the
meta data system. As not everybody was there and we don't have anything
really fixed I thought it might be a good idea to summarize the
discussion in this rather long email so everyone can participate in the
not yet finished discussion and implementation.

From the beginning it was clear that sqlite is no option as it is only
available on about 50% of all DokuWiki installations we have data about.

We've searched for different solutions. An obvious one was Lucene, there
exists a pure PHP implementation of it in the Zend Framework. We decided
against it as it's documentation states that it consumes a lot of
memory, data can't be reconstructed from partially corrupted data and it
is generally not recommended as system for storing data. I would have
allowed very nice features like a search in the produced HTML code, but
well, that isn't what we were looking for.

We found another pure PHP database called Flat File DataBase (FFDB) for
PHP and can be found at 
http://www.sourceforge.net/projects/ffdb-php/, but it wasn't convincing
as it is more or less unmaintained and seemed to have a poor
performance. We also found SofaDB which claims to be like CouchDB, but
is in an early alpha stage and seems to be more like our current meta
data store implemented with JSON.

We've also had a look at http://mimesis.110mb.com/ which is another pure
PHP key-value store, but it didn't look that promising, too.

Apart from that we agreed that the current concept that metadata is
stored in relatively simple files along with the wiki pages is something
that fits into the spirit of DokuWiki quite well and thus should be
kept.

We've had a discussion about the file format for these meta files and if
it should be changed to JSON and eventually if instructions should be
changed to JSON, too. This would make these files more human readable
which would make debugging easier and also creating them from outside of
DokuWiki e.g. in an import script would be easier. We agreed however
that writing meta files from outside of DokuWiki is no common usecase
and thus performance is the point that really matters. I therefore
created some performance tests using a dump of dokuwiki.org as test
data. These tests including detailed results from a couple of hundreds
of pages can be found at
https://github.com/michitux/dokuwiki-test-serialize. In short: For
instructions, JSON encode seems to be twice as fast as serialize, but
again in decoding unserialize is about 10% faster. This is for the
native json function in PHP, but that is only a module and might be
disabled. Therefore we have a PHP only implementation already in
DokuWiki included, for decoding it is about 250 times slower than the
native implementation. For metadata it is different, there serialize is
about 10% faster than JSON encode and unserialize is almost twice as
fast as JSON decode. This time the PHP JSON decoder is about 500 times
slower.

These numbers convinced us to leave the current file format both for
metadata and instructions. If there should be reasons for changing that
file format it would also be easy to do that as it is wrapped into a few
helper functions.

We however noticed that meta data isn't written where it should be
written which causes a couple of bugs. Currently the meta data is
written whenever it doesn't exist and is accessed or when the xhtml
cache is not used. That means that the metadata is only updated when the
page content is displayed, so it isn't updated when the blogtng plugin
is used and when the title is updated and useheading is activated the
old title is displayed on the first view. Andi and me thus discussed
that updating it in saveWikiText and in the indexer when the page is
newer than the metadata should be enough (and thus the old update call
should be completely replaced by the two new ones). It might be that
some plugins that disable the xhtml cache rely on the meta data being
rendered on every view, but we hope such a plugin doesn't exist. If you
know of something that might break because of that change please tell.

Additionally we want to have a quick way to search for certain criterias
in meta files so e.g. tagging could be implemented easily without
additional structures. We had a look at the current DokuWiki indexer and
have some quite concrete ideas how it can be used for metadata. Of
course not all metadata fields shall be indexed. The best way we could
find for selecting these fields is introducing an event that populates a
list that is filled with some default entries and that can be extended
by plugins. For each of these keys an index file shall be created in the
same way as the current word indexes. Like for words there shall also be
a reverse index from pages to the different meta properties for removing
old data. The index shall be updated in lib/exe/indexer.php whenever
the metadata file is newer than a special meta-indexed file in the same
directory. When reading such an index a short check if the corresponding
pages still exists shall be done and when a page no longer exists it
shall be removed from the index. We are not completely sure what we
shall do with indexes of uninstalled plugins, but we think just keeping
them like persistent metadata will be the best/easiest solution.

Do you have some additional ideas, comments, or do you even know the fast,
low-memory, pure PHP (ideally document-oriented) database we've missed?
Then please write about that!

And btw. if you didn't understand anything of what I've written above
(and yet read this) you should either dig deeper into the whole metadata
and indexer stuff documented at http://www.dokuwiki.org/devel:metadata
and http://www.dokuwiki.org/devel:fulltextindex or just ignore it as
hopefully everything will work so well you won't notice anything on the
next release except less bugs and new features.

Michael
-- 
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Other related posts: