Thanks Ralf - this is extremely helpful. More than you had to do :-) Just one thing, tokenizing a piece of text into words and phrases, you have not mentioned whether your variety of tools could be used to be this efficiently. - Sorry to push my luck. JB -----Original Message----- From: yunqa-bounce@xxxxxxxxxxxxx [mailto:yunqa-bounce@xxxxxxxxxxxxx] On Behalf Of Delphi Inspiration Sent: 27 February 2011 13:28 To: yunqa@xxxxxxxxxxxxx Subject: [yunqa.de] Re: Advice for tokenizer code On 27.02.2011 12:24, Jon Burnham wrote: > I have huge lists of URLs that I need to analyse in various ways. This has > to be done at very high speed. > > Textanz does word and phrase tokenization and frequency counting with a stop > list (as well as concordance and dispersion). This looks to me like just another Concordance or Key Word in Context tool: http://en.wikipedia.org/wiki/Concordance_(publishing) http://en.wikipedia.org/wiki/Key_Word_in_Context > I used to use a varied selection of pre-built string libraries (e.g. > Hyperstring, Faststrings) for D7, I am now on my own doing this with Unicode > and XE. In my experience, custom text parsing implementations are generally much faster and flexible than using pre-built string libraries. You might also consider parser generators, even though they tend to handle formal languages better than natural ones. This article contains a useful information and links: http://en.wikipedia.org/wiki/Natural_language_processing > So my question is, before I choose, should I use reg-ex (too slow? too > inflexible for this work? inputs too difficult) - or should I start from > scratch with another approach? I am not sure how much detail you need, but the DISQLite3 Full Text Search module comes to mind. If you store your text into a FTS4 virtual table, you can use the the snippet() SQL function to retrieve a given word in context very easily. If you require more detailed information, why not look at how existing concordancer or KWIC implementations handle the task? I am not aware of any Delphi projects, but there are several related open source projects in other programming languages which might be skimming for concepts and ideas. Ralf _______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa _______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa