[proteomics] Strategy for Bioinformatics Data Management

  • From: Mavi Gozler <mavigozler@xxxxxxxxx>
  • To: proteomics@xxxxxxxxxxxxx
  • Date: Tue, 26 Dec 2006 22:49:16 -0800 (PST)

The people who have been reading my last couple of posts to this group will 
certainly be aware of my frustrations and irritations with PLGS.

Waters/Micromass concedes my points, although not in so many words.  If any of 
you have worked with PLGS, especially with single projects that have 
accumulated a couple of hundred megabytes of data (all XML-formatted files uner 
the PLGS 'root', I believe), you probably have seen PLGS take 10-20 minutes 
just to load this data into the host before you can even view or manipulate it.

This makes me wonder about the programming approach to the PLGS application.  I 
have done a fair amount of coding in my past---someone actually was insensible 
enough to pay me to do it!

PLGS and other bioinformatics applications like it end up with tons of data 
obviously, especially if there is MS and MS/MS data.  I only have MALDI MS data 
on target plates at this time, but the size of the PK-zipped archive exported 
by PLGS is now 220 MB!!!  Yes, that is the compressed archive...which 7Zip in 
ultra compression mode can squeeze down maybe 2% more!

So my question is, if there is THAT much data, PLGS should not be managing it 
itself, right?  It should be wrapped around and calling the API of a reliable 
database engine like Berkeley DB or maybe MySQL, and leave this to worry about 
speed and reliability of data management.  PLGS should instead be 

What approach do other bioinformatics interfaces use when they are faced with 
MEGA- or GIGAbytes of data?  My natural reaction is to set up an app that 
employs a reliable open-source db system...no need to re-invent the wheel.  

Waters/Micromass says it has no plans to alter the user interface although I 
have sent unsolicited comments as to why I think they should alter it.  As to 
the speed question, I also doubt they will address the matter or divulge even a 
summary of the approach they took to coding this eyesore.

I have read comments in other forums that says Waters/Micromass "software" (not 
sure if that includes PLGS) is not the worst out there, which falls into the 
class of faint praise.  And I have been told that applications like PLGS are 
"the best out there."  If indeed that claim is even half-true, it shows what a 
sorry state bioinformatics software is in with respect to the characteristics 
of ease-of-use and high throughput, and it means we are still a software 
generation or two away from having the software we should have.  And I assume 
in that statement that the user is not at all freed from the burden of moving 
data along from one processing point to the other (heavily user-supervised 
pipelines rather than fully automated pipelines).


Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 

Other related posts:

  • » [proteomics] Strategy for Bioinformatics Data Management