[dokuwiki] Re: [Search Engin Improvement] Indexing words by weight...

  • From: Michael Hamann <michael@xxxxxxxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Sat, 4 Feb 2012 19:30:20 +0100

Hi,

On Sat, Dec 10, 2011 at 6:30 PM, Andreas Gohr <andi@xxxxxxxxxxxxxx> wrote:
>> I'm developping a plugin that index word by weight and in a perform search
>> sort page by searched words with the highest weight.
>>
>> So I want to propose to improve the search engine directly in the core of
>> dokuwiki.
>
> In fact I had something like this in mind when I designed the index
> system but somehow never bothered with added weighted scores.
>
> So if this improves the quality of results, sure why not add it to
> core? Please do a github fork and send a pull request with your
> proposed changes.

Here a few thoughts of me regarding the concept in general and also
regarding the pull request at
https://github.com/splitbrain/dokuwiki/pull/70.

I think there is one thing we need to carefully think about: Currently
we are indexing the page source. In order to get the importance of
words we need to index the result of the parser. Like Andi I think
this should be done using a special renderer. This has many advantages
like getting rid of the syntax elements, but it has also
disadvantages:
a) The output of a syntax might be dynamic or depend on external
sources. How do we know what to index and when the index should be
updated?
b) The output of a syntax might depend on the currently logged in user
(e.g. the include plugin includes a page only when the current user
can access it). How can we take ACLs into consideration when using the
index (or should we at all)?

I furthermore propose to add a new renderer for search which allows
plugins to add words to the index with a certain importance.

I'm not against these changes, I think there are just a few problems
we should think about and solve.

For the first problem I think we could add a cache handler like we
already have for other output formats so plugins can do whatever they
consider appropriate. The indexer would then check on every request of
a page if its index should be updated.

Ideally it should be possible to attach ACLs to tokens added to the
index. In the metadata index I would also like to have that because
there exactly this problem exists already. I'm thinking about
something like attaching the id of a page (or rather: its id in the
index) to every entry in the index. Then all entries are fetched
without taking ACLs into consideration, storing all matching tokens.
After that the found entries are post-processed removing all tokens
associated with ids the user can't access, removing all pages that no
longer fall into the selection. I'm not sure how fast this works and
if it's the best solution, so if you have another idea, feel free to
suggest it.

One might wonder why I care about metadata that is protected by ACLs.
The problem is that e.g. in the include plugin I want to store the
metadata of all pages that are also included in the xhtml for a simple
reason: Syntax plugins in the included pages could rely on them, for
example for cache handling. If this was different I would suggest to
skip this feature and just not to include any data that isn't readable
for anoymous users from included pages. I think this would also
concern plugins like the data plugin when the table that shows a list
of pages with additional data shall be indexed.

Currently plugins can add and remove text from the search index
dynamically so the indexer needs to re-parse the whole text input for
getting the page content. I'm wondering what do with this. Not
regarding backwards compatibility I would propose to remove that
functionality and instead pass the whole rendered search index input
to an event. Then plugins could also add their own content Previously
plugins could remove certain parts of the content. I think this is no
longer necessary with a special search renderer syntax plugins can
specify themselves the content that should be added.

I think we should also think about a fallback for syntax plugins if
they haven't implement the search renderer mode. Maybe we could
introduce something like a function getSupportedRenderers() that
returns an array with 'xhtml' and 'metadata' as default content and
only use the syntax plugins render function when 'search' or 'all' is
returned and if not use the original syntax that was in the page? The
idea about 'all' is that it is possible for syntax plugins to simply
call functions of the core renderer instead of directly producing any
output and thus to support all renderers that exist in a DokuWiki
installation.

Michael
-- 
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Other related posts: