[mira_talk] Questions on sequencing coverage ...

(This is a bit of rambling, perhaps I'd rather need a blog for that ...)


There are some questions in my inbox where I sometimes scratch my head and ask 
myself how to answer.

One of those categories is a question like this one (paraphrased and 
summarised from several very similar mails):

  "We have sequenced this $organism$ with 454 and have been trying to assemble
   it with various assemblers, but every assembly program gives us back
   several hundreds of contigs, MIRA too. We have spent $a lot of time$ in
   trying to enhance the assembly, but before we go into wet lab finishing,
   can you please advise what we should do to improve the situation?"

Sometimes attached to these kind of mails are some assembly reports from MIRA 
and various other assemblers, sometimes it takes a few mails forth and back 
for me to get them. 

Only one case turned out to be a really, really ugly bug with several 
repetitive phages/prophages in the genome. In the 6 other out of the now 7 
cases this year, I had the 'surprise' (which after some time wasn't one 
anymore) to discover average coverages of around 10x-12x for the 454 
sequencing. In one case it went even doen to 7x from a quarter of a plate.

There's one thing to be said about coverage and de-novo assembly: especially
for bacteria, getting more than 'decent' coverage with 454 FLX or Titanium is 
*cheap*. Every assembly program I know will be happy to assemble de-novo 
genomes with coverages of 25x, 30x, 40x ... and the number of 
contigs/scaffolds will still drop dramatically between a 15x 454 and a 30x 
454 project.

With the introduction of the Titanium series, a full 454 plate of unpaired 
reads may seem to be too much: you should get at least 200 megabase out of a 
plate, press releases from 454 suggest 400 to 600 megabases (I'll know for 
sure next week when I get the first Titanium data myself).

In any case, do some calculations: if the coverage you expect to get reaches
50x (e.g. 200MB raw sequence for a 4MB genome), then you (respectively the
assembler) can still throw away the worst 20% of the sequence (with lots of
sequencing errors), concentrate on the really, really good parts of the 
sequences and still get nice and long contigs with an average coverage of 
40x.

I think that at the moment a full 454 plate will cost you between 8000 to 
12000 bucks (or less). Then you just need to do the math: is it worth to 
invest 10, 20, 30 or more days of wet lab work, designing primers, doing PCR 
sequencing etc. and trying to close remaining gaps when you went for a 'low' 
coverage? Or do you invest a few thousand bucks to get some additional 
coverage and considerably reduce the incertainties and gaps which remain?

Remember, you probably want to do research on your bug and not research on how
to best assemble and close genomes. So even if you put (PhD) students on the
job, it's costing you time and money if you wanted to save money earlier in
the sequencing. Penny-wise and pound-foolish is almost never a good strategy
:-)

I do agree that with eukaryotes, things start to get a bit more interesting
from the financial point of view ...

Have a nice weekend,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: