[yunqa.de] Advice for tokenizer code

  • From: "Jon Burnham" <jba@xxxxxxxxxxxxxx>
  • To: <yunqa@xxxxxxxxxxxxx>
  • Date: Sun, 27 Feb 2011 11:24:59 -0000

Dear Ralf

Hope you are well and that Spring coming brings as much relief to you as it
does me :-)


I have a question which is taking a bit of a liberty. It's not strictly a
support question - more advice. But since you are a Delphi guru, I hope you
don't mind.

I have been looking at using your tools to build the equivalent of a program
like this:

http://www.cro-code.com/textanz.jsp

I have huge lists of URLs that I need to analyse in various ways. This has
to be done at very high speed.

Textanz does word and phrase tokenization and frequency counting with a stop
list (as well as concordance and dispersion).

I used to use a varied selection of pre-built string libraries (e.g.
Hyperstring, Faststrings) for D7, I am now on my own doing this with Unicode
and XE.

So my question is, before I choose, should I use reg-ex (too slow? too
inflexible for this work? inputs too difficult) - or should I start from
scratch with another approach?

I thought that your container libraries might be useful for sorting and
counting. I will be using the HTML parser obviously.

But the tokenizing is quite a problem, as it needs to be very fast, easy to
repeat and flexible.

What would you do ?

Kind regards

Jon



-----Original Message-----
From: yunqa-bounce@xxxxxxxxxxxxx [mailto:yunqa-bounce@xxxxxxxxxxxxx] On
Behalf Of Delphi Inspiration
Sent: 31 December 2010 17:55
To: yunqa@xxxxxxxxxxxxx
Subject: [yunqa.de] Re: Installing latest DiRegEx with DiHtmlParser

On 30.12.2010 18:20, Jim Bretti wrote:

> I'm trying to install DiHtmlParser (version 5.2.0) with DiRegEx (version
5.3.1), and getting an error when compiling DiHtmlParser.

At first, my apologies!

> The error is "Unit DIContainers was compiled with a different version of
DITypes.PAnsiStringBase08".

Short answer:

I will release a DIHtmlParser compatibility update very shortly which
correct the type incompatibility.


Long answer + background info:

Problems of this nature may arise because some of my products (including the
ones you mention) share the same units with other products. I believe that
this is generally a good thing because it avoids redundancies and keeps
applications small.

Unfortunately, Delphi is very picky about this. If such shared units are not
100% identical, Delphi sometimes complains about a "type incompatibility"
even though that type definition did not change at all. At least I do not
see any changes in the example given and do not understand what causes this
particular incompatibility.

To minimise potential clashes, I try to change these common as little as
possible and avoid any type incompatibilities. But I still do not know when
and why Delphi will see two types as incompatible. Sometimes they even show
for *.dcu pre-compiled units only but not for *.pas source files. This makes
it even harder to predict multiple product conflicts. 


Here is some general advice if you experience difficulties installing
multiple DI products into the Delphi IDE:

a) Uninstall all previously installed DI products from your Delphi IDE and
remove those files from your Delphi search path by either deleting the
search path or deleting the files.

b) use the latest versions only and

c) extract all packages into the same root folder, in the order oldest to
newest (so newer files overwrite older ones).

With all packages extracted, open any one of the *.dpk files suitable for
your Delphi version. Let's work with DIHtmlParser_D7.dpk right now. Next add
to this package the registration files for the remaining products, which, in
your case, is just one:

* DIRegEx_Reg.pas

Now this package contains both of your products. It should compile well and
after installation you should see the component icons in your IDE.

Ralf
_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa




_______________________________________________
Delphi Inspiration mailing list
yunqa@xxxxxxxxxxxxx
//www.freelists.org/list/yunqa



Other related posts: