Hi Allessandro, if they generate scaffolds I assume you have paired-ends? Scaffolds are fine, .. and probrably Roche software knows best Roche data. But how many contigs generate these 39 scaffolds? (From what you have written I don't assume that 933 contigs form 39 clusters, or?). On the other hand I have here a small phytoplasm genome, which really makes problems in assembly. IMHO the real problem is the data quality due to a GC content of ~26% and some kind of repetitive nature. The reads themselves are quite short, ~200bp (titanium data!). I'd love to have an extra sanger or solexa coverage ;-) just somecomments, no soultion, Sven 2009/12/4 Alessandro Riccombeni <rikkomba@xxxxxxxxx> > HI, > > thanks for the help. > Yes, actually I think it was mentioned that we paid half as usual for this > sequencing, so probably it was at a lower coverage at the origin. > The sequencing service provided a preassembled dataset as well: 39 > scaffolds and 933 contigs. > Some info: 13 Mbs, 2.38% are Ns, GC in the sequence is 36%, largest > scaffold is 1.9 Mb and the smallest is 3Kb. > After my first MIRA run with the 454 only (as they shouldn't have used any > non-454 reads) I was quite clueless about which strategy did they use to get > 39 scaffolds where I got 11000 contigs. As I wrote, they didn't use any > custom adaptor, so I don't know what I should do as preprocessing goes... > > > On Thu, Dec 3, 2009 at 8:18 PM, Bastien Chevreux <bach@xxxxxxxxxxxx>wrote: > >> On Donnerstag 03 Dezember 2009 Alessandro Riccombeni wrote: >> > I guess I am using on MIRA in the wrong way. >> > My dataset is composed of 512,000 454 Titanium reads, without any custom >> > vector from the sequencing service. I have to assemble a fungal genome >> of >> > around 13 MBs. >> > [...] >> >> Hi Alessandro, >> >> regardless of what sequencing providers or Roche may tell you ... I don't >> recommend coverages below 20x in 454 sequencing. 30x is a reasonably good >> coverage to get in balance between costs and number of contigs. >> >> Your numbers tell me you have *at most* a theoretical coverage of 16, >> probably >> more in the region of 14 even as 'miramem' still uses the numbers Roche >> gave >> in the early days (475) though, having seen a few Titanium sets now, I >> suspect >> having 400 bases as mean length is more accurate. [Notre to self: change >> that >> ASAP in the miramem estimator] >> >> Even worse, the numbers appended show that MIRA estimates the average >> coverage >> to be more like 8x or 9x at most. Which is somehwat disastrous. >> >> > [...] >> > I got around 10500-11000 large contigs, trying normal, draft and >> accurate. >> > I am adding info from the output at the end of this message. >> >> Something's wrong with your data, at least that's my impression. In the >> most >> simplest case you underestimated the size of the genome and it's not 13MB, >> but >> more like 26 MB. >> >> Then there may be the possibility of contamination: was it really just one >> organism which was sequenced? >> >> Also, your bug might be highly repetitive, which adds another possibility >> for >> a large number of contigs. >> >> And last but not least ... it may be a problem of the sequencing kit used. >> If >> your organism has high GC (>=60%) and the Titanium data was generated with >> a >> sequencing kit delivered in the first 7 to 8 month of this year, then you >> need >> to talk to your provider as they need to talk to Roche. >> >> > I also tried doing a hybrid assembly with 4800 paired Sanger reads, >> getting >> > around 9700 large contigs. I found out that there are vectors for the >> > Forward and Reverse Sanger reads, so this is surely creating problems. >> > Nonetheless, I expected getting a much better result from my 454 reads. >> >> The 5k paired Sanger won't help for such a fragemented assembly as the 454 >> data suggests. Even if you'd trim away the sequencing vectors a bit >> better: >> with 5k paired end you cant't scaffold 10k contigs. >> >> > My question is: is there something I am (blatantly) doing wrong? What am >> I >> > overlooking? >> > It's my first approach to assembling, so please excuse me for being >> > annoying. >> >> Did your sequencing provider give you the result of a Newbler assembly of >> the >> Titanium data and is it equally disastrous? Then there's nothing you can >> do >> apart getting more sequences or getting the project resequenced if it >> turns >> out to be a bad sequencing kit. >> >> Regards, >> Bastien >> >> PS: oh, and don't be lured into the "we could try to close with Solexa" >> trap >> should the provider propose it. Accept only if *they* do the assembly >> and >> deliver you the result free of charge :-) >> >> >> -- >> You have received this mail because you are subscribed to the mira_talk >> mailing list. For information on how to subscribe or unsubscribe, please >> visit http://www.chevreux.org/mira_mailinglists.html >> > >