[mira_talk] Re: Lots of contigs, then segmentation fault

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 3 May 2011 21:16:31 +0200

On Tuesday 19 April 2011 18:14:47 Egon Ozer wrote:
> I'd be happy to provide my data to you for testing.  Do you want the sff
> files or my extracted fasta, qual, and xml files for the 454 data?

Hello Egon,

your data set made MIRA (and me) sweat, actually, quite a lot. It's not that 
much that version 3.2.1 crashed on it, but that my newer development version, 
while not crashing, performed ... really not good: way too many contigs for my 
liking.

I've been busy the week-end over to understand what happened that MIRA 
absolutely did not like that data set and found the reason: it looks like that 
this paired-end FLX data contains a lot more false duplicates than I have ever 
seen up to now. These false duplicates contain, I think, PCR artefacts ... and 
these "sequencing errors" let MIRA believe that there are repeats and/or 
ploidy differences.

I had to develop a couple of new algorithms to deal with these kind of things. 
Not everything I thought of has been implemented yes, but already I think the 
improvements are good enough to test. E.g., here are the results of 3.2.1.15:

  Number of contigs:    116
  Largest contig:       893586
  N50 contig size:      172613
  N90 contig size:      34046
  N95 contig size:      21118

and here for my current development version:

  Number of contigs:    75
  Largest contig:       901116
  N50 contig size:      397873
  N90 contig size:      108334
  N95 contig size:      52586

Almost halved the number of contigs and N50 doubled. Taking then a hybrid 
assembly with your 454 and Solexa data, I get this:

  Number of contigs:    55
  Largest contig:       894849
  N50 contig size:      588120
  N90 contig size:      139889
  N95 contig size:      62263

The number of contigs was more than halved and the N50/90/95 numbers trippled.

The next release on SourceForge will contain those enhancements (but can take 
a week or two). Contact me if you want to test the current head of the 
development tree before that :-)

B.

Other related posts: