Hi, On Sat, Dec 10, 2011 at 6:30 PM, Andreas Gohr <andi@xxxxxxxxxxxxxx> wrote: >> I'm developping a plugin that index word by weight and in a perform search >> sort page by searched words with the highest weight. >> >> So I want to propose to improve the search engine directly in the core of >> dokuwiki. > > In fact I had something like this in mind when I designed the index > system but somehow never bothered with added weighted scores. > > So if this improves the quality of results, sure why not add it to > core? Please do a github fork and send a pull request with your > proposed changes. Here a few thoughts of me regarding the concept in general and also regarding the pull request at https://github.com/splitbrain/dokuwiki/pull/70. I think there is one thing we need to carefully think about: Currently we are indexing the page source. In order to get the importance of words we need to index the result of the parser. Like Andi I think this should be done using a special renderer. This has many advantages like getting rid of the syntax elements, but it has also disadvantages: a) The output of a syntax might be dynamic or depend on external sources. How do we know what to index and when the index should be updated? b) The output of a syntax might depend on the currently logged in user (e.g. the include plugin includes a page only when the current user can access it). How can we take ACLs into consideration when using the index (or should we at all)? I furthermore propose to add a new renderer for search which allows plugins to add words to the index with a certain importance. I'm not against these changes, I think there are just a few problems we should think about and solve. For the first problem I think we could add a cache handler like we already have for other output formats so plugins can do whatever they consider appropriate. The indexer would then check on every request of a page if its index should be updated. Ideally it should be possible to attach ACLs to tokens added to the index. In the metadata index I would also like to have that because there exactly this problem exists already. I'm thinking about something like attaching the id of a page (or rather: its id in the index) to every entry in the index. Then all entries are fetched without taking ACLs into consideration, storing all matching tokens. After that the found entries are post-processed removing all tokens associated with ids the user can't access, removing all pages that no longer fall into the selection. I'm not sure how fast this works and if it's the best solution, so if you have another idea, feel free to suggest it. One might wonder why I care about metadata that is protected by ACLs. The problem is that e.g. in the include plugin I want to store the metadata of all pages that are also included in the xhtml for a simple reason: Syntax plugins in the included pages could rely on them, for example for cache handling. If this was different I would suggest to skip this feature and just not to include any data that isn't readable for anoymous users from included pages. I think this would also concern plugins like the data plugin when the table that shows a list of pages with additional data shall be indexed. Currently plugins can add and remove text from the search index dynamically so the indexer needs to re-parse the whole text input for getting the page content. I'm wondering what do with this. Not regarding backwards compatibility I would propose to remove that functionality and instead pass the whole rendered search index input to an event. Then plugins could also add their own content Previously plugins could remove certain parts of the content. I think this is no longer necessary with a special search renderer syntax plugins can specify themselves the content that should be added. I think we should also think about a fallback for syntax plugins if they haven't implement the search renderer mode. Maybe we could introduce something like a function getSupportedRenderers() that returns an array with 'xhtml' and 'metadata' as default content and only use the syntax plugins render function when 'search' or 'all' is returned and if not use the original syntax that was in the page? The idea about 'all' is that it is possible for syntax plugins to simply call functions of the core renderer instead of directly producing any output and thus to support all renderers that exist in a DokuWiki installation. Michael -- DokuWiki mailing list - more info at http://www.dokuwiki.org/mailinglist