[yunqa.de] Re: Advice for tokenizer code

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Sun, 27 Feb 2011 14:27:59 +0100

On 27.02.2011 12:24, Jon Burnham wrote:

> I have huge lists of URLs that I need to analyse in various ways. This has
> to be done at very high speed.
> 
> Textanz does word and phrase tokenization and frequency counting with a stop
> list (as well as concordance and dispersion).

This looks to me like just another Concordance or Key Word in Context tool:

  http://en.wikipedia.org/wiki/Concordance_(publishing)
  http://en.wikipedia.org/wiki/Key_Word_in_Context

> I used to use a varied selection of pre-built string libraries (e.g.
> Hyperstring, Faststrings) for D7, I am now on my own doing this with Unicode
> and XE.

In my experience, custom text parsing implementations are generally much
faster and flexible than using pre-built string libraries. You might
also consider parser generators, even though they tend to handle formal
languages better than natural ones.

This article contains a useful information and links:

  http://en.wikipedia.org/wiki/Natural_language_processing

> So my question is, before I choose, should I use reg-ex (too slow? too
> inflexible for this work? inputs too difficult) - or should I start from
> scratch with another approach?

I am not sure how much detail you need, but the DISQLite3 Full Text
Search module comes to mind. If you store your text into a FTS4 virtual
table, you can use the the snippet() SQL function to retrieve a given
word in context very easily.

If you require more detailed information, why not look at how existing
concordancer or KWIC implementations handle the task? I am not aware of
any Delphi projects, but there are several related open source projects
in other programming languages which might be skimming for concepts and
ideas.

Ralf
_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: