[mira_talk] Re: N50 decrease while sequencing depth increase ?

  • From: Joseph Fass <joseph.fass@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 4 Aug 2009 12:14:26 -0700

I'll chime in here, hoping this is relevant as well:

I've had this experience with velvet assemblies of the ~6K phiX virus that's
usually run in one of the control lanes of the Illumina GA chips.  Using a
subsample of a fraction of that lane's reads corresponding to ~20-30X "kmer"
coverage (velvet users will recognize this; it's coverage not in terms of
bases but rather the hash length - often 21, or 25, or longer depending on
read length), velvet assembles phiX perfectly.  Using higher coverages, such
as the whole lane, breaks the assembly into smaller pieces.

Now, that's velvet, not Mira ... but, I'm wondering, is there a reason for
this from the field of information theory?  Does using ultra-high coverage
ensure that there's an error of every kind at every positon, this confusing
even a "perfect" assembler?  Can anyone comment on this?

~Joe







On Tue, Aug 4, 2009 at 1:49 AM, Brian Forde <bforde@xxxxxxxxx> wrote:

> Not sure if this is relevant
>
> I believe that with 454 sequences the assembly can become saturated at a
> certain point and I believe that this can depend on the size of the genome.
> MWG did a presentation on it at FEMS this year. they showed that with a 2.7
> mb genome the assembly becomes saturated at a coverge of 21-22x above this
> threshold the assembly quality decrease i.e. you in fact get more contigs.
> Also a representative of GATC said something similar at a seminar where I
> work recently.
>
> I am currently looking for papers on this and will pass them on if/when I
> find them.
>
> regards
>
> Brian
>
>
> On Mon, Aug 3, 2009 at 2:06 PM, Bastien Chevreux <bach@xxxxxxxxxxxx>wrote:
>
>> On Montag 03 August 2009 johann JOETS wrote:
>> > maybe I made something wrong using Mira ?
>>
>> Hello Johann,
>>
>> the command line was exactly as it should be for such a test case.
>>
>> > Hope you will have an idea about this.
>>
>> A few. Let's see what's been biting you.
>>
>> My first question: does metasim generate reads in forward and reverse
>> direction? (I never used that program) That might be important later.
>>
>> Second question: did you tell metasim to also simulate paired-ends or did
>> you
>> just simulate an unpaired data generation. You wrote that you'll want to
>> have
>> a look at a piece of a plant genome. Plants tend to be also quite complex
>> in
>> terms of repetitiveness, so going with unpaired sequencing might not be
>> the
>> best strategy there.
>>
>> > [...]
>> > The N50 is as follow :
>> >       N50     av cov
>> > 5X    257     0
>> > 10X   5419    8,95
>> > 15X   111471  13,74
>> > 20X   108664  18,29
>> > 25X   27526   22,21
>> > 50X   446     40,08
>>
>> Third question: how did you calculate the N50? Did you do that yourself by
>> just or did you take numbers from the file "*_info_assembly.txt"? If the
>> later, did you take numbers for 'large contigs' or 'all contigs'.
>>
>> If the numbers above are for all contigs, then I'm not so much troubled.
>> You
>> would need to filter the results first to get rid of contig debris. Please
>> also see http://chevreux.org/uploads/media/mira3_usage.html#section_14(What
>> to do with MIRA result files) regarding this.
>>
>> Now, if the numbers above are for large contigs, then I'm troubled and I'd
>> ask
>> you to make me the sets for 20x to 100x available via FTP if possible.
>>
>> > You may notice that the average coverage is roughly as expected. However
>> I
>> > was surprised by the decrease of n50 for datasets deeper than 15X. This
>> is
>> > also true for the length of the largest contig.
>> > As I know were reads should have been assembled I can check assembly
>> > quality (roughly I count breakages in contigs). According to these
>> tests,
>> > the quality of the assembly also drop down.
>>
>> Now, this is a bit strange. I would expect that starting with 20x, the
>> whole
>> 150kb should be covered ... with 50x certainly more so. I would really
>> like to
>> have a look at those data sets.
>>
>> Regards,
>>   Bastien
>>
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk
>> mailing list. For information on how to subscribe or unsubscribe, please
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>
>
>
> --
> "If scientists knew what they were doing they wouldn't call it research"
>



-- 
Joseph Fass
Bioinformatics Programmer
UC Davis Bioinformatics Core
joseph.fass -at- gmail.com (professional)
970.227.5928 (c)  ||  530.752.2698 (w)

Other related posts: