[mira_talk] Re: looking for a cost-effective way of obtaining a quick-and-dirty draft genome

  • From: "abenjak ." <abenjak@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 12 Mar 2015 20:17:37 +0100

Hi Francisco,

I cannot give you precise answers, but would like to give you some points
to consider:
One thing to consider is the size of your genome. MIRA is not designed for
such a big genome and you might need at least 100 GB or RAM just to run it,
and even with dozens of CPUs you might be waiting forever to finish
(someone correct me if I am wrong).

MiSeq gives longer reads, but be careful in estimating the final coverage:
2x300 in reality is 2x250 because the last 50 bases of the reads are often
of poor qualities. The newest update of the machine's software improved
base calling a lot, but you will still have bad qualities at the ends of
the R2 reads. Also, with standard Illumina protocols, one rarely gets very
long fragments suitable for a 2x300, which means that a large fraction of
the reads will overlap. In a way this is good because you can merge such
reads, overcome the bad quality areas with the overlapping consensus and
obtain longer SE, but this will be at the expense of the final coverage.
Based on this, your final coverage with one MiSeq might be less than 5x,
and I don't think you want to waste money for that, especially if your
budget is tight and you can afford only one run. In general, de novo
assemblies with Illumina of <10x coverage do not make sense.

I would keep it safe and for the similar money rather do a HiSeq PEs, which
should give you around 50x coverage if your genome is homozygote (is it?
It's a big difference, not only for the coverage, but also for the
assembly).

HiSeqPE vs HiSeqMP: sorry don't have experience with MP. My understanding
is that they are really needed for de novo assemblies of large genomes
(which normally have a lot of repeats), but the price of MP library is much
higher than PE (I think). But then, for an eukaryotic genome PEs alone will
likely perform very badly (in terms of N50 contig size), while MPs might
pay off by providing longer scaffolds (using other assemblers). You would
still end up with a fair amount of gaps, but at least the genic part should
get properly assembled.

Hope some of this can help.
Andrej



On Thu, Mar 12, 2015 at 4:47 PM, Juan Francisco <juan_francisco@xxxxxx>
wrote:

> Hello,
>
> I am looking for a cheap and fast way of obtaining a very preliminary
> draft of a diploid eukaryotic genome (diploid genome size 1.5 Gbp, haplome
> size 750 Mbp). My target is not too ambitious: a mere 10-kb N50 for the
> contigs (no need for scaffolds) would be sufficient for my purpose. To save
> time and money I would like to prepare and sequence a single library. Here
> are the four different strategies I have been considering so far:
>
> - a MiSeq 2*300 bp 13.2 Gbp run should give me a ca. 8X coverage of the
> diploid genome. Would it be enough coverage for MIRA to reach my 10-kb N50
> target or should I request two runs? Can I hope to get there with a mere PE
> library (requesting an insert size as large as possible, which should be at
> least 600 bp) or should I go for a more expensive 3-kb MP library (which
> should bridge larger repeats but may contain a sizable percentage of
> contaminant PE reads, possibly breaking havoc the assembly process)?
>
> - alternatively I could go for a HiSeq 2*150 bp RapidRun (of either the PE
> or MP library mentioned above), which should yield a 56X coverage of the
> diploid genome. This would be more expensive than a MiSeq run and the reads
> would be shorter, but the coverage would be higher...
>
> Which strategy would you recommend me to use (MiSeqPE, MiSeqMP, HiSeqPE,
> HiSeqMP)? Or would you suggest trying something different?
>
> Thanks a lot in advance,
> Juan Francisco
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>

Other related posts: