[mira_talk] Re: MIRA: I am doing it wrong.

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 3 Dec 2009 21:18:05 +0100

On Donnerstag 03 Dezember 2009 Alessandro Riccombeni wrote:
> I guess I am using on MIRA in the wrong way.
> My dataset is composed of 512,000 454 Titanium reads, without any custom
> vector from the sequencing service. I have to assemble a fungal genome of
> around 13 MBs.
> [...]

Hi Alessandro,

regardless of what sequencing providers or Roche may tell you ... I don't 
recommend coverages below 20x in 454 sequencing. 30x is a reasonably good 
coverage to get in balance between costs and number of contigs.

Your numbers tell me you have *at most* a theoretical coverage of 16, probably 
more in the region of 14 even as 'miramem' still uses the numbers Roche gave 
in the early days (475) though, having seen a few Titanium sets now, I suspect 
having 400 bases as mean length is more accurate. [Notre to self: change that 
ASAP in the miramem estimator]

Even worse, the numbers appended show that MIRA estimates the average coverage 
to be more like 8x or 9x at most. Which is somehwat disastrous.

> [...]
> I got around 10500-11000 large contigs, trying normal, draft and accurate.
> I am adding info from the output at the end of this message.

Something's wrong with your data, at least that's my impression. In the most 
simplest case you underestimated the size of the genome and it's not 13MB, but 
more like 26 MB.

Then there may be the possibility of contamination: was it really just one 
organism which was sequenced?

Also, your bug might be highly repetitive, which adds another possibility for 
a large number of contigs.

And last but not least ... it may be a problem of the sequencing kit used. If 
your organism has high GC (>=60%) and the Titanium data was generated with a 
sequencing kit delivered in the first 7 to 8 month of this year, then you need 
to talk to your provider as they need to talk to Roche.

> I also tried doing a hybrid assembly with 4800 paired Sanger reads, getting
> around 9700 large contigs. I found out that there are vectors for the
> Forward and Reverse Sanger reads, so this is surely creating problems.
> Nonetheless, I expected getting a much better result from my 454 reads.

The 5k paired Sanger won't help for such a fragemented assembly as the 454 
data suggests. Even if you'd trim away the sequencing vectors a bit better: 
with 5k paired end you cant't scaffold 10k contigs.

> My question is: is there something I am (blatantly) doing wrong? What am I
> overlooking?
> It's my first approach to assembling, so please excuse me for being
> annoying.

Did your sequencing provider give you the result of a Newbler assembly of the 
Titanium data and is it equally disastrous? Then there's nothing you can do 
apart getting more sequences or getting the project resequenced if it turns 
out to be a bad sequencing kit.

Regards,
  Bastien

PS: oh, and don't be lured into the "we could try to close with Solexa" trap
    should the provider propose it. Accept only if *they* do the assembly and
    deliver you the result free of charge :-)


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: