On Donnerstag 03 Dezember 2009 Alessandro Riccombeni wrote: > I guess I am using on MIRA in the wrong way. > My dataset is composed of 512,000 454 Titanium reads, without any custom > vector from the sequencing service. I have to assemble a fungal genome of > around 13 MBs. > [...] Hi Alessandro, regardless of what sequencing providers or Roche may tell you ... I don't recommend coverages below 20x in 454 sequencing. 30x is a reasonably good coverage to get in balance between costs and number of contigs. Your numbers tell me you have *at most* a theoretical coverage of 16, probably more in the region of 14 even as 'miramem' still uses the numbers Roche gave in the early days (475) though, having seen a few Titanium sets now, I suspect having 400 bases as mean length is more accurate. [Notre to self: change that ASAP in the miramem estimator] Even worse, the numbers appended show that MIRA estimates the average coverage to be more like 8x or 9x at most. Which is somehwat disastrous. > [...] > I got around 10500-11000 large contigs, trying normal, draft and accurate. > I am adding info from the output at the end of this message. Something's wrong with your data, at least that's my impression. In the most simplest case you underestimated the size of the genome and it's not 13MB, but more like 26 MB. Then there may be the possibility of contamination: was it really just one organism which was sequenced? Also, your bug might be highly repetitive, which adds another possibility for a large number of contigs. And last but not least ... it may be a problem of the sequencing kit used. If your organism has high GC (>=60%) and the Titanium data was generated with a sequencing kit delivered in the first 7 to 8 month of this year, then you need to talk to your provider as they need to talk to Roche. > I also tried doing a hybrid assembly with 4800 paired Sanger reads, getting > around 9700 large contigs. I found out that there are vectors for the > Forward and Reverse Sanger reads, so this is surely creating problems. > Nonetheless, I expected getting a much better result from my 454 reads. The 5k paired Sanger won't help for such a fragemented assembly as the 454 data suggests. Even if you'd trim away the sequencing vectors a bit better: with 5k paired end you cant't scaffold 10k contigs. > My question is: is there something I am (blatantly) doing wrong? What am I > overlooking? > It's my first approach to assembling, so please excuse me for being > annoying. Did your sequencing provider give you the result of a Newbler assembly of the Titanium data and is it equally disastrous? Then there's nothing you can do apart getting more sequences or getting the project resequenced if it turns out to be a bad sequencing kit. Regards, Bastien PS: oh, and don't be lured into the "we could try to close with Solexa" trap should the provider propose it. Accept only if *they* do the assembly and deliver you the result free of charge :-) -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html