[mira_talk] Re: MIRA: I am doing it wrong.

  • From: Jeremiah Davie <jdavie@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 4 Dec 2009 12:21:43 -0500

Hi Alessandro,
From your description, my guess is that your sequencing provider used the Newbler assembly program that Roche/454 packages with the sequencer to create your 933 contigs from the individual reads in the *.sff files. As to the scaffolds, I'd have to agree with Sven that this sounds like paired-end data was included as well, unless I've misunderstood something. My experience, albeit very limited, with MIRA has resulted in many more contigs of smaller sizes generated from 454 data than those created by Newbler. My understanding is that results from differences in the way MIRA and Newbler solve disputes arising from sequencing errors (If I'm wrong, Bastien, please correct me :). This is particularly apparent when 454 sequencing reads are of low(er) quality, as my impression of the two programs is that Newbler makes assumptions by default that MIRA will not make by default. I'm also concerned about the read lengths, those should be longer. Taken together, your data indicates that the sequencer provider probably owes you another run. Best wishes, - Jeremiah


On Dec 4, 2009, at 8:22 AM, Sven Klages wrote:

Hi Allessandro,

if they generate scaffolds I assume you have paired-ends?
Scaffolds are fine, .. and probrably Roche software knows best Roche data.

But how many contigs generate these 39 scaffolds? (From what you have written I don't assume
that 933 contigs form 39 clusters, or?).

On the other hand I have here a small phytoplasm genome, which really makes problems in assembly. IMHO the real problem is the data quality due to a GC content of ~26% and some kind of repetitive nature. The reads themselves are quite short, ~200bp (titanium data!).

I'd love to have an extra  sanger or solexa coverage ;-)

just somecomments, no soultion,
Sven

2009/12/4 Alessandro Riccombeni <rikkomba@xxxxxxxxx>
HI,

thanks for the help.
Yes, actually I think it was mentioned that we paid half as usual for this sequencing, so probably it was at a lower coverage at the origin. The sequencing service provided a preassembled dataset as well: 39 scaffolds and 933 contigs. Some info: 13 Mbs, 2.38% are Ns, GC in the sequence is 36%, largest scaffold is 1.9 Mb and the smallest is 3Kb. After my first MIRA run with the 454 only (as they shouldn't have used any non-454 reads) I was quite clueless about which strategy did they use to get 39 scaffolds where I got 11000 contigs. As I wrote, they didn't use any custom adaptor, so I don't know what I should do as preprocessing goes...


On Thu, Dec 3, 2009 at 8:18 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
On Donnerstag 03 Dezember 2009 Alessandro Riccombeni wrote:
> I guess I am using on MIRA in the wrong way.
> My dataset is composed of 512,000 454 Titanium reads, without any custom > vector from the sequencing service. I have to assemble a fungal genome of
> around 13 MBs.
> [...]

Hi Alessandro,

regardless of what sequencing providers or Roche may tell you ... I don't recommend coverages below 20x in 454 sequencing. 30x is a reasonably good
coverage to get in balance between costs and number of contigs.

Your numbers tell me you have *at most* a theoretical coverage of 16, probably more in the region of 14 even as 'miramem' still uses the numbers Roche gave in the early days (475) though, having seen a few Titanium sets now, I suspect having 400 bases as mean length is more accurate. [Notre to self: change that
ASAP in the miramem estimator]

Even worse, the numbers appended show that MIRA estimates the average coverage
to be more like 8x or 9x at most. Which is somehwat disastrous.

> [...]
> I got around 10500-11000 large contigs, trying normal, draft and accurate.
> I am adding info from the output at the end of this message.

Something's wrong with your data, at least that's my impression. In the most simplest case you underestimated the size of the genome and it's not 13MB, but
more like 26 MB.

Then there may be the possibility of contamination: was it really just one
organism which was sequenced?

Also, your bug might be highly repetitive, which adds another possibility for
a large number of contigs.

And last but not least ... it may be a problem of the sequencing kit used. If your organism has high GC (>=60%) and the Titanium data was generated with a sequencing kit delivered in the first 7 to 8 month of this year, then you need
to talk to your provider as they need to talk to Roche.

> I also tried doing a hybrid assembly with 4800 paired Sanger reads, getting > around 9700 large contigs. I found out that there are vectors for the > Forward and Reverse Sanger reads, so this is surely creating problems. > Nonetheless, I expected getting a much better result from my 454 reads.

The 5k paired Sanger won't help for such a fragemented assembly as the 454 data suggests. Even if you'd trim away the sequencing vectors a bit better:
with 5k paired end you cant't scaffold 10k contigs.

> My question is: is there something I am (blatantly) doing wrong? What am I
> overlooking?
> It's my first approach to assembling, so please excuse me for being
> annoying.

Did your sequencing provider give you the result of a Newbler assembly of the Titanium data and is it equally disastrous? Then there's nothing you can do apart getting more sequences or getting the project resequenced if it turns
out to be a bad sequencing kit.

Regards,
 Bastien

PS: oh, and don't be lured into the "we could try to close with Solexa" trap should the provider propose it. Accept only if *they* do the assembly and
   deliver you the result free of charge :-)


--
You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html



Other related posts: