On 27.02.2011 12:24, Jon Burnham wrote: > I have huge lists of URLs that I need to analyse in various ways. This has > to be done at very high speed. > > Textanz does word and phrase tokenization and frequency counting with a stop > list (as well as concordance and dispersion). This looks to me like just another Concordance or Key Word in Context tool: http://en.wikipedia.org/wiki/Concordance_(publishing) http://en.wikipedia.org/wiki/Key_Word_in_Context > I used to use a varied selection of pre-built string libraries (e.g. > Hyperstring, Faststrings) for D7, I am now on my own doing this with Unicode > and XE. In my experience, custom text parsing implementations are generally much faster and flexible than using pre-built string libraries. You might also consider parser generators, even though they tend to handle formal languages better than natural ones. This article contains a useful information and links: http://en.wikipedia.org/wiki/Natural_language_processing > So my question is, before I choose, should I use reg-ex (too slow? too > inflexible for this work? inputs too difficult) - or should I start from > scratch with another approach? I am not sure how much detail you need, but the DISQLite3 Full Text Search module comes to mind. If you store your text into a FTS4 virtual table, you can use the the snippet() SQL function to retrieve a given word in context very easily. If you require more detailed information, why not look at how existing concordancer or KWIC implementations handle the task? I am not aware of any Delphi projects, but there are several related open source projects in other programming languages which might be skimming for concepts and ideas. Ralf _______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa