[dokuwiki] [Search Engin Improvement] Indexing words by weight...

  • From: Adrien Bettini <abettini@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Fri, 09 Dec 2011 17:34:48 +0100

Hello,

I'm developping a plugin that index word by weight and in a perform search sort page by searched words with the highest weight.

The defect of the plugin:
1. Put code in inc/indexer.php (dokuwiki core)
    - I add a hook in the method getPageWords() before "return $index;"
    - I use hook in getPageWords() to get the words indexed in saved page
2. I duplicate code from indexer.php (with little change)
    - I duplicate code to reuse the code of dokuwiki that work fine
    - the ideas in dokuwiki still preserved and just add the improvement
3. Can't work with the command line indexer
    - the plugin have to know the page ID to use the method updateTuple()
- Where the hook is put can't get the page ID or only with the global ID but did't work on command line

Organisation of weight (can be changed):
'header'       =>    16, //h1: 16; h2: 8; h3: 5; h4: 4; h5: 3
'strong'       =>    6,
'internalink'  =>    5,
'externallink' =>    4,
'acronym'      =>    2,
'underline'    =>    3,
'emphasis'     =>    3,
'deleted'      =>    2,
'monospace'    =>    1,

Text without decoration didn't have weight but I add the number of occurrence in the weight.

#### Example:
dokuwiki page content:

- www.{domain}/devel:php :
====== PHP: Hypertext Preprocessor ======
**PHP** is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages.

**PHP** development began in 1994 when the Danish/Greenlandic/Canadian programmer Rasmus Lerdorf initially created a set of Perl scripts he called "Personal Home Page Tools" to maintain his personal homepage.

- www.{domain}/devel:adrien :
====== Adrien ======
My name is adrien and I'm a PHP developper.
I like programming in **PHP**. I use eclipse to program in PHP.
I use different PHP framework like yii and ZendFramework.
My favorite javascript framework is jQuery. I use it for ajax to requete PHP server.


Explanation:
I search for PHP in dokuwiki
Without the improvement:
Page found:                                    occurrence:
Adrien...                                                   5
PHP: Hypertext Preprocessor...            3

With the improvement:
Page found:                                             weight:
PHP: Hypertext Preprocessor... 16(h1)+6(strong)+6(strong)+3(occurrence) = 31 Adrien... 6(strong)+5(occurrence) = 11

####

So I want to propose to improve the search engine directly in the core of dokuwiki.

Why this improvement have to be in the core of dokuwiki?
1. Better search engine - users have more chance to found the page searched (page with highest weight in top)
2. No duplicate code like the plugin
3. No code put in dokuwiki core to use the plugins
4. Better perf unlike the plugin

The next upgrade is to add more weight on pages that have most internal link of them on other page.

Only the inc/indexer.php will be modified to the improvement.

Adrien Bettini
Societe astrel <http://www.astrel.fr/>
--
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Other related posts: