[mira_talk] Re: highly heterozygous assembly

  • From: Chenling <chenlingantelope@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 19 Aug 2014 10:12:43 -0700

By curiosity, why can’t the half coverage criterion be used to differentiate 
haploid from diploid? Is it because the coverage vary to greatly without 
biological reasons for it? But there are softwares out there that use coverage 
information to detect copy number variations. 

Chenling 

On Aug 19, 2014, at 10:06 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:

> On 20 Aug 2014, at 5:40 , Adrian Pelin <apelin20@xxxxxxxxx> wrote:
>> You mentioned "The remaining reads will then probably form 2 more small 
>> contigs containing the remaining SNPs.".
>> 
>> These contigs are redundant when wanting to get a haploid assembly. They 
>> overestimate the haploid assembly size, and do not reflect the correct 
>> diploid genome size. Furthermore, when one wants to map reads to contigs to 
>> find SNPs, such small contigs need to be excluded, because otherwise reads 
>> would map to them as well, and that's a problem, because then it looks like 
>> you don't have SNPs when you actually do.
>> 
>> It would be great if MIRA could label these contigs somehow. Adding the 
>> region as to where they belong would be something great as well (what are 
>> the rough coordinated of the bigger contig where the variation has caused 
>> these smaller ones to be built).
>> 
>> Maybe there is a way to do this already and remove these smaller contigs.
> 
> MIRA does not implicitly label these contigs because this is a very, very 
> analysis specific thing, i.e., it depends on what you want.
> 
> Unfortunately, knowing what you want does not always mean that you will get 
> it. Case in point: your request.
> 
> Imagine you have a diploid organism with a chromosome 10 MB. Of those 10 MB, 
> the first 8 MB have no real repeats and are identical on both chromosomes, 
> the remaining 2 MB have no real repeats either but have 1 SNP every 20 bases 
> when comparing the haplotypes. An assembler will reconstruct the 8 MB in one 
> (or several contigs if we had some coverage problems), the remaining 2 MB 
> will form contigs describing the two haplotypes of 2 MB. That is, as you 
> said, you get contigs for approximately 12 MB. But these contigs cannot be 
> easily sorted out into different bins: neither by length (they can be up to 
> several 100kb in the described scenario), nor by coverage (in general the 
> coverage of the different haplotype contigs have half the coverage of the 
> contigs of the first 8MB) nor by any other means I could think of.
> 
> So, basically you are stuck with your contigs describing 12 MB of chromosomal 
> DNA for a diploid organism that has a chromosome of 10MB, and that is 
> probably the best you can get without spending an extensive amount of time 
> trying to join contigs yourself … because for a real organism, you would need 
> to check that those contigs describe different haplotypes and are not, as 
> will also be the case, slightly different copies of repetitive areas in the 
> genome.
> 
> Honestly, I have no idea how one could get what you want with short reads,
> 
> B.
> 
> 
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: