[mira_talk] Re: poor quality assembly results

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 6 May 2015 08:41:42 +0200

On 06 May 2015, at 8:28 , Jens Christian Froslev Nielsen
<jens.c.nielsen@xxxxxxxxxxx> wrote:

But, what do you mean that I have 2500 125mers. I dont think I wrote that? I
use a subsample of 10 –15 M 2*125mers

HiSeq 2500 ...

Regarding subsample size (im not a CS expert), as I understand it, this is
essentially a memory and a runtime issue. Do you have any recommendation on
how big a subsample can use?

The question is not how big a subsample you *can* you, but what size of a
subsample *makes sense*. As a rule of thumb: anything above a coverage of 100x
really does not make any sense as everything in the genome will almost surely
be covered.

At some point, adding more coverage is even detrimental because, yes, even
Illumina data contains errors. Some of these are NOT random, and at some point
you have so much non-random errors that it looks like as if there is a valid
minor repeat variation … and MIRA starts to disentangle those areas and create
additional contigs.

If you were to use 36m 125mers, you’d end up with a theoretical coverage of
450x for a 10 mb genome. Even after clipping that would still be around 400x.
That is total overkill, really, and GGCxG induced sequencing errors will make
your life difficult.

B.


Other related posts: