[yunqa.de] Re: Advice for tokenizer code

  • From: "Jon Burnham" <jba@xxxxxxxxxxxxxx>
  • To: <yunqa@xxxxxxxxxxxxx>
  • Date: Sun, 27 Feb 2011 13:40:00 -0000

Thanks Ralf - this is extremely helpful. More than you had to do :-)

Just one thing, tokenizing a piece of text into words and phrases, you have
not mentioned whether your variety of tools could be used to be this
efficiently.

- Sorry to push my luck.

JB



-----Original Message-----
From: yunqa-bounce@xxxxxxxxxxxxx [mailto:yunqa-bounce@xxxxxxxxxxxxx] On
Behalf Of Delphi Inspiration
Sent: 27 February 2011 13:28
To: yunqa@xxxxxxxxxxxxx
Subject: [yunqa.de] Re: Advice for tokenizer code

On 27.02.2011 12:24, Jon Burnham wrote:

> I have huge lists of URLs that I need to analyse in various ways. This has
> to be done at very high speed.
> 
> Textanz does word and phrase tokenization and frequency counting with a
stop
> list (as well as concordance and dispersion).

This looks to me like just another Concordance or Key Word in Context tool:

  http://en.wikipedia.org/wiki/Concordance_(publishing)
  http://en.wikipedia.org/wiki/Key_Word_in_Context

> I used to use a varied selection of pre-built string libraries (e.g.
> Hyperstring, Faststrings) for D7, I am now on my own doing this with
Unicode
> and XE.

In my experience, custom text parsing implementations are generally much
faster and flexible than using pre-built string libraries. You might
also consider parser generators, even though they tend to handle formal
languages better than natural ones.

This article contains a useful information and links:

  http://en.wikipedia.org/wiki/Natural_language_processing

> So my question is, before I choose, should I use reg-ex (too slow? too
> inflexible for this work? inputs too difficult) - or should I start from
> scratch with another approach?

I am not sure how much detail you need, but the DISQLite3 Full Text
Search module comes to mind. If you store your text into a FTS4 virtual
table, you can use the the snippet() SQL function to retrieve a given
word in context very easily.

If you require more detailed information, why not look at how existing
concordancer or KWIC implementations handle the task? I am not aware of
any Delphi projects, but there are several related open source projects
in other programming languages which might be skimming for concepts and
ideas.

Ralf
_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa




_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: