[mira_talk] Re: 454/Solexa hybrid assembly of a 35Mbp genome?

On Dienstag 02 Juni 2009 Khan, Anar wrote:
> [...]
> My fungus' expected genome size is 35Mb. I'm running some simulations on a
> close relative's genome to choose the best sequencing strategy. I'd like to
> try assembling half or full plate of 454 Titanium reads, together with say
> an eighth/quarter plate of paired end SOLiD 3 reads (2 x 50bp) 

Hello Anar,

Jan has already given most of the answers I would have, too. In short: going 
de-novo (or even hybrid) with short reads is pretty risky for eukaryotes. The 
paired-end for microreads (be they SOLiD or Solexa) will not bring much 
advantage in case of small insert sizes (say, 200-500) when you have a decent 
coverage of longer reads which are as long as the insert size of the small 
reads. Things look better if you have a 2k insert size like it is now possible 
for Solexa (SOLiD I don't know).

With one plate of Titanium (1m reads @400 bases) you'd get ~11x-12x coverage 
from 454. It's a bit low, but certainly a start.

> (btw I'm
> just plugging the SOLiD data into MIRA as Solexa data for now - i.e. it's
> just simulated nucleotides rather than colour space).

Other people got burned with this strategy in the past, with any assembler. At 
least with the first and second generation of SOLiDs which I got told had a 
third to half the reads with at least one sequencing error somewhere. This one 
error together with naive conversion of colour space to nucleotide bases then 
leads to reads which are totally wrong starting at the sequencing error site.

This is also the reason why MIRA does not accept SOLiD as input at the moment 
... and it might very well be that it never really will. To work well with 
that data you need to work in colour space, and I simply don't have the time 
to implement all needed routines for that. ABI could alleviate the problem a 
bit if they'd provide robust conversion from colour space to nucleotide space 
*without* having to use a reference, but I imagine that this is not easy to do 
and would also eat up quite some resources.

> I've scoured the mailing lists for parameters which
> might control memory usage, but I don't think anything (e.g. -SK:mhpr,
> -SK:mchr) could bring memory usage below 16Gb RAM. Is that correct?

Correct. No way this is enough for a hybrid 454 + Solexa/SOLiD assembly of a 
eukaryote.

> If 454/Solexa hybrid assembly isn't possible using MIRA, I'll probably
> assemble the 454 reads using MIRA, and use the likes of Bambus to
> "scaffold" up the MIRA contigs. You may wish to comment on whether this
> sounds like a reasonable alternative approach.

It is reasonable (not only with MIRA, but also with other assemblers), I know 
that quite some people are going this way.  The 16GB should be enough for a 
Titanium plate (normally).

Prokaryotes in complete hybrid are possible since quite a while with MIRA, 
depending on the version, sometimes more and sometimes less. I don't advertise 
it much as I'm still experimenting a bit and would not recommend trying 
without asking me which version to use and what parameters to try.

You just need to realise that assembly of a eukaryote will take you quite some 
time :-)

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: