[yunqa.de] Re: DiSqlite's FTS and Chinese characters

  • From: Edwin Yip <edwin.yip@xxxxxxxxxxxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Tue, 22 Sep 2009 20:14:59 +0800

Hi Ralf,
Thank you for the detailed comments. Maybe I can just use the like operator
instead when the query string is none-European characters? For example:

if search_string_is_european_lang then
  Select * from my ftstable match 'myWord'
else
  Select * from my ftstable where field1 like '*myChineseWord*'

Although I'm not sure if 'like' can be used against a FTS3 virtual table.

On Tue, Sep 22, 2009 at 7:45 PM, Delphi Inspiration <delphi@xxxxxxxx> wrote:

> At 11:48 22.09.2009, Edwin Yip wrote:
>
> >I'm planning to start using the FTS for a new project, however, I found
> that it cannot handle the Chinese very well, since Chinese doesn't use
> spaces to split words like English, there is no spaces between words except
> the symbols such as periods or commas. Any thoughts ? Thank you.
>
> DISQLite3 implements the default SQLite3 full text search modules, namely
> FTS1, FTS2 (both deprecated) and FTS3. For word separation, they all rely on
> their build-in tokenizers targeted at European languages which use white
> space to tell words apart.
>
> For non European languages you can set up custom tokenizers for particular
> languages, both natural and formal.
>
> The DISQLite3_Full_Text_Search demo includes an example tokenizer suitable
> for indexing Delphi / Pascal source code files. It is implemented in
> DISQLite3PascalTokenizer.pas. The code is commented and should hopefully
> serve as a base for other custom tokenizers.
>
> So the technical side is quite simple: Just write a new tokenizer module
> consisting of five callback functions. The practical side, however, is more
> difficult:
>
> "Word segmentation is a non-trivial task, and it is hard to have a "good"
> segmenter. It is almost impossible to segment a sentence perfectly. In fact
> even human has trouble to segment some ambiguous sentences." [1]
>
> I guess it will require intensive linguistic research and understanding of
> the language to build a full fledged FTS module for Chinese.
>
> Ralf
>
> [1]
> http://projectile.sv.cmu.edu/research/public/tools/segmentation/eval/index.htm
>
> _______________________________________________
> Delphi Inspiration mailing list
> yunqa@xxxxxxxxxxxxx
> //www.freelists.org/list/yunqa
>
>
>
>


-- 
Best Regards,
Edwin Yip

Mind Mapping is as Effortless as Typing
http://www.InnovationGear.com

Other related posts: