HI, thanks for the help. Yes, actually I think it was mentioned that we paid half as usual for this sequencing, so probably it was at a lower coverage at the origin. The sequencing service provided a preassembled dataset as well: 39 scaffolds and 933 contigs. Some info: 13 Mbs, 2.38% are Ns, GC in the sequence is 36%, largest scaffold is 1.9 Mb and the smallest is 3Kb. After my first MIRA run with the 454 only (as they shouldn't have used any non-454 reads) I was quite clueless about which strategy did they use to get 39 scaffolds where I got 11000 contigs. As I wrote, they didn't use any custom adaptor, so I don't know what I should do as preprocessing goes... On Thu, Dec 3, 2009 at 8:18 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: > On Donnerstag 03 Dezember 2009 Alessandro Riccombeni wrote: > > I guess I am using on MIRA in the wrong way. > > My dataset is composed of 512,000 454 Titanium reads, without any custom > > vector from the sequencing service. I have to assemble a fungal genome of > > around 13 MBs. > > [...] > > Hi Alessandro, > > regardless of what sequencing providers or Roche may tell you ... I don't > recommend coverages below 20x in 454 sequencing. 30x is a reasonably good > coverage to get in balance between costs and number of contigs. > > Your numbers tell me you have *at most* a theoretical coverage of 16, > probably > more in the region of 14 even as 'miramem' still uses the numbers Roche > gave > in the early days (475) though, having seen a few Titanium sets now, I > suspect > having 400 bases as mean length is more accurate. [Notre to self: change > that > ASAP in the miramem estimator] > > Even worse, the numbers appended show that MIRA estimates the average > coverage > to be more like 8x or 9x at most. Which is somehwat disastrous. > > > [...] > > I got around 10500-11000 large contigs, trying normal, draft and > accurate. > > I am adding info from the output at the end of this message. > > Something's wrong with your data, at least that's my impression. In the > most > simplest case you underestimated the size of the genome and it's not 13MB, > but > more like 26 MB. > > Then there may be the possibility of contamination: was it really just one > organism which was sequenced? > > Also, your bug might be highly repetitive, which adds another possibility > for > a large number of contigs. > > And last but not least ... it may be a problem of the sequencing kit used. > If > your organism has high GC (>=60%) and the Titanium data was generated with > a > sequencing kit delivered in the first 7 to 8 month of this year, then you > need > to talk to your provider as they need to talk to Roche. > > > I also tried doing a hybrid assembly with 4800 paired Sanger reads, > getting > > around 9700 large contigs. I found out that there are vectors for the > > Forward and Reverse Sanger reads, so this is surely creating problems. > > Nonetheless, I expected getting a much better result from my 454 reads. > > The 5k paired Sanger won't help for such a fragemented assembly as the 454 > data suggests. Even if you'd trim away the sequencing vectors a bit better: > with 5k paired end you cant't scaffold 10k contigs. > > > My question is: is there something I am (blatantly) doing wrong? What am > I > > overlooking? > > It's my first approach to assembling, so please excuse me for being > > annoying. > > Did your sequencing provider give you the result of a Newbler assembly of > the > Titanium data and is it equally disastrous? Then there's nothing you can do > apart getting more sequences or getting the project resequenced if it turns > out to be a bad sequencing kit. > > Regards, > Bastien > > PS: oh, and don't be lured into the "we could try to close with Solexa" > trap > should the provider propose it. Accept only if *they* do the assembly > and > deliver you the result free of charge :-) > > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html >