[mira_talk] Re: Lots of contigs, then segmentation fault

  • From: Egon Ozer <e-ozer@xxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 3 May 2011 15:04:36 -0500

Wow.  So it wasn't just me...

That's a very interesting result.  The 454 reads weren't from some fly-by-night 
sequencing provider, either.  Definitely one of the big sequencing centers that 
did those for me.  I'd be worried that my organism is expected to be somewhat 
repeat-heavy and a fair number of those duplicates might be real, but the 
assembly numbers you were able to get look so good that I'm less suspicious.  
Of course I'll have to look at my own assembly and/or the results of your 
hybrid run to really see if it's believable for my bacterium... 

Thanks for looking at this for me.  I was starting to get kind of despondent 
about the whole thing and you've restored my hope in sequencing :)

If you're offering your development version that gave you the assembly numbers 
you sent me, I wouldn't mind taking a crack at it here on my home-base CPU.  
Otherwise, either sending me the caf and/or fasta + qual files from your hybrid 
assembly would at least give me a chance to move forward on my project a bit 
until the Sourceforge version is up and I can repeat the assembly.  Other- 
otherwise, if you have any suggestions for me to manually remove the excessive 
duplicates from my 454 library so I can run it through Mira 3.2.1.15 and maybe 
improve the assembly, that might be quite helpful as well.

Thanks again for all the help.  Above and beyond the call of duty.  If I have 
any more kids I'll consider naming them Bastien. 

- E

P.S.  It seems if I run Mira without messing with any of the output files 
before it's finished (i.e. not opening up the log file in TextEdit to see how 
far along it is), I don't get the seg fault problem.  Seems weird that that 
would be the cause of the problem, but maybe I just need to leave it alone and 
let it do its job in peace.


On May 3, 2011, at 2:16 PM, Bastien Chevreux wrote:

> On Tuesday 19 April 2011 18:14:47 Egon Ozer wrote:
> > I'd be happy to provide my data to you for testing.  Do you want the sff
> > files or my extracted fasta, qual, and xml files for the 454 data?
> 
> Hello Egon,
> 
> your data set made MIRA (and me) sweat, actually, quite a lot. It's not that 
> much that version 3.2.1 crashed on it, but that my newer development version, 
> while not crashing, performed ... really not good: way too many contigs for 
> my liking.
> 
> I've been busy the week-end over to understand what happened that MIRA 
> absolutely did not like that data set and found the reason: it looks like 
> that this paired-end FLX data contains a lot more false duplicates than I 
> have ever seen up to now. These false duplicates contain, I think, PCR 
> artefacts ... and these "sequencing errors" let MIRA believe that there are 
> repeats and/or ploidy differences.
> 
> I had to develop a couple of new algorithms to deal with these kind of 
> things. Not everything I thought of has been implemented yes, but already I 
> think the improvements are good enough to test. E.g., here are the results of 
> 3.2.1.15:
> 
>   Number of contigs:    116
>   Largest contig:       893586
>   N50 contig size:      172613
>   N90 contig size:      34046
>   N95 contig size:      21118
> 
> and here for my current development version:
> 
>   Number of contigs:    75
>   Largest contig:       901116
>   N50 contig size:      397873
>   N90 contig size:      108334
>   N95 contig size:      52586
> 
> Almost halved the number of contigs and N50 doubled. Taking then a hybrid 
> assembly with your 454 and Solexa data, I get this:
> 
>   Number of contigs:    55
>   Largest contig:       894849
>   N50 contig size:      588120
>   N90 contig size:      139889
>   N95 contig size:      62263
> 
> The number of contigs was more than halved and the N50/90/95 numbers trippled.
> 
> The next release on SourceForge will contain those enhancements (but can take 
> a week or two). Contact me if you want to test the current head of the 
> development tree before that :-)
> 
> B.
> 
> 
> 

Other related posts: