[yunqa.de] Re: Read/Write compressed streams

  • From: Rolf Lampa <rolf.lampa@xxxxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Thu, 03 Jul 2008 12:38:58 +0200



Delphi Inspiration skrev:
Rolf Lampa wrote:

... I wonder if there's any VERY fast TFileStream based readers/writers out there which can optionally read .bz2,.
As an example, please find attached the BZip2 uncompress xmlParserInputBuffer which the WikiTaxi Importer uses to feed the compressed Wikipedia XML dumps to the XML parser. Please substitute the DIBZip2Api.pas with your favourite BZip2 Delphi implementation.
Thank you very much for your help! I'll try to get this to work, I'll try very hard asap.

[Edit]: ...Regex would apply. Speed is crucial though. (processing time like 12 days, to come down to < 3 hours, is what I'm currently onto, so... ).

Less than 3 hours is very reasonable. To give yet another example: The WikiTaxi Importer completes the job in less than 2 hours for the English WikiPedia on a recent laptop system, performing these steps: read, uncompress, parse XML, recompress, store to database. 
This sounds very good.

As for me I don't store to a database, but
in less than an hour, on a Dual2 laptop, I do read and write unzipped raw xml (currently ~17GB), populating a pool of internal TMediawikiPage objects with the text and page properties (for advanced text manipulations), insert some markup at the beginning of text, perform Filtering and Extract on pages (select or skip on date, users, namespaces, intervals, inject or extract text, search/replace, title & category  skiplists, dirty words-skip lists, and many other small features like different tidy ups (regex), counting links, logging strange looking syntax aso, plus generating sql tables (so far only the "big three" tables though, ie, page, revision and text tables, other tables are on the todo list).

== Sinking ship ==

But it's when I check the option "Expand templates" when the processing starts to take tiime...

But, strangely enough, the Swedish Wikipedia (~450.000 pages) takes only about 7 minutes (~950 pgs/sec, and that's really really fast!), while the enwiki drops "through the floor" down to 4-5 pgs/sec, and it never comes over 10pgs/sec, which means it takes so many days... (I really don't know why this is, but obviously there's a bottleneck with exponential characteristics
somewhere, although I have not managed to spot the problem yet, so I'm still a bit puzzled about it (not even ProDelphi gave me a pointer as to where the bottle neck is <scratching head>). Oh well.

Three Questions:

Q1: Do you have a "bundle price" for your code libraries, kind of like "DISuite"? It seems like several of your libraries would be very useful in my project(s) (tools for data manipulation of MW data).

Q2: Has anyone made a persistence adapter for SQLite for Bold? (Making a
persistence adapter is a bit too advanced for me, I think, but perhaps someone else has done it?). SQLite seems like a speedy little nasty thing, and it seems  to be just what I'd need for use with Bold to speed it up a bit. (Btw, I made a complete enterprise system with Bold (transport logistics), with ~350 classes, which went live 2003, still up running, still being extended by a team of developers, IB backend, > 15GB, and growing exponentially. Since Bold has proven to scale, perhaps also SQLite could handle a huge *Bold* system, even enwiki data stored in Bold objects? (my complex Bold systems typically have very very much relations, and lots of indexed tables, which can make it a bit slow on saving lots of objects, but then again it can be amazingly fast for performing complex logic (the system mentioned performs very much & complex realtime calculations)).

Q3: I use TPerlRegEx some in my app, but it's slow and it lacks support for unicode. What is your comment on speed compared to TPerlRegEx and unicode support, regarding the amounts of data to process in the worst case scenario enwiki...?

Regards,

// Rolf Lampa
_______________________________________________ Delphi Inspiration mailing list yunqa@xxxxxxxxxxxxx //www.freelists.org/list/yunqa

Other related posts: