[dokuwiki] Re: New meta data system

  • From: TNHarris <telliamed@xxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Mon, 15 Nov 2010 20:03:44 -0500

On 11/15/2010 05:48 AM, Michael Hamann wrote:

Additionally we want to have a quick way to search for certain criterias
in meta files so e.g. tagging could be implemented easily without
additional structures. We had a look at the current DokuWiki indexer and
have some quite concrete ideas how it can be used for metadata.

I'd be interested in seeing these "concrete ideas".

I'm working at making improvements to the indexer[1]. The indexer I think should be a class with proper separation of concerns. And there would be the option of completely replacing it. (With Xapian, for example).

Right now I'm just looking at the tokenizer. Chinese and Japanese texts need a morphological analyzer to find word breaks. I'm putting in an option to use an external tokenizer such as MeCab[2]. I don't think it should be a plugin. The function is too low-level and having to maintain an action hook would make it difficult to change the indexer. And using a different external tokenizer is just a matter of changing the exec call. Although there is one PHP extension that I know of.

I have a question for Kazutaka Miyasaka, who rewrote the fulltext term parser.

function ft_termParser($term, &$stopwords,
                       $consider_asian = true,
                       $phrase_mode = false) {
    $parsed = '';
    if ($consider_asian) {
        // successive asian characters need to be searched as a phrase
        $words = preg_split('/('.IDX_ASIAN.'+)/u', $term, -1,
                      PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
        foreach ($words as $word) {
            if (preg_match('/'.IDX_ASIAN.'/u', $word))
               $phrase_mode = true;
            $parsed .= ft_termParser($word, $stopwords, false,
                                     $phrase_mode);
        }
    } else {
//... etc

In the foreach loop, $phrase_mode is changed if there is an Asian character in the string, then ft_termParser is called recursively. But I think it's supposed to only be changing the value for the call, and not for successive iterations. Shouldn't it instead be... ?

$parsed .= ft_termParser($word, $stopwords, false,
              $phrase_mode || preg_match('/'.IDX_ASIAN.'/u', $word));

But I'm not sure if I'm understanding the function correctly. Additionally, some unit tests for the fulltext engine will help with regressions as the indexer changes.

[1] https://github.com/whoopdedo/dokuwiki/tree/tokenizer-rewrite
[2] http://sourceforge.net/projects/mecab/

--
- tom
telliamed@xxxxxxxxxxxxx
--
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Other related posts: