[dokuwiki] Re: New meta data system

From: TNHarris <telliamed@xxxxxxxxxxx>
To: dokuwiki@xxxxxxxxxxxxx
Date: Mon, 15 Nov 2010 20:03:44 -0500

On 11/15/2010 05:48 AM, Michael Hamann wrote:


Additionally we want to have a quick way to search for certain criterias
in meta files so e.g. tagging could be implemented easily without
additional structures. We had a look at the current DokuWiki indexer and
have some quite concrete ideas how it can be used for metadata.


I'd be interested in seeing these "concrete ideas".

I'm working at making improvements to the indexer[1]. The indexer Ithink should be a class with proper separation of concerns. And therewould be the option of completely replacing it. (With Xapian, for example).

Right now I'm just looking at the tokenizer. Chinese and Japanese textsneed a morphological analyzer to find word breaks. I'm putting in anoption to use an external tokenizer such as MeCab[2]. I don't think itshould be a plugin. The function is too low-level and having to maintainan action hook would make it difficult to change the indexer. And usinga different external tokenizer is just a matter of changing the execcall. Although there is one PHP extension that I know of.

I have a question for Kazutaka Miyasaka, who rewrote the fulltext termparser.


function ft_termParser($term, &$stopwords,
                       $consider_asian = true,
                       $phrase_mode = false) {
    $parsed = '';
    if ($consider_asian) {
        // successive asian characters need to be searched as a phrase
        $words = preg_split('/('.IDX_ASIAN.'+)/u', $term, -1,
                      PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
        foreach ($words as $word) {
            if (preg_match('/'.IDX_ASIAN.'/u', $word))
               $phrase_mode = true;
            $parsed .= ft_termParser($word, $stopwords, false,
                                     $phrase_mode);
        }
    } else {
//... etc

In the foreach loop, $phrase_mode is changed if there is an Asiancharacter in the string, then ft_termParser is called recursively. But Ithink it's supposed to only be changing the value for the call, and notfor successive iterations. Shouldn't it instead be... ?


$parsed .= ft_termParser($word, $stopwords, false,
              $phrase_mode || preg_match('/'.IDX_ASIAN.'/u', $word));

But I'm not sure if I'm understanding the function correctly.Additionally, some unit tests for the fulltext engine will help withregressions as the indexer changes.


[1] https://github.com/whoopdedo/dokuwiki/tree/tokenizer-rewrite
[2] http://sourceforge.net/projects/mecab/

--
- tom
telliamed@xxxxxxxxxxxxx
--
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Follow-Ups:
- [dokuwiki] Re: New meta data system
  - From: Michael Hamann
- [dokuwiki] Re: New meta data system
  - From: Kazutaka Miyasaka

References:
- [dokuwiki] New meta data system
  - From: Michael Hamann

[dokuwiki] Re: New meta data system

Other related posts: