[mira_talk] Re: K-mer coverage filtering

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 22 Aug 2014 23:27:01 +0200

On 22 Aug 2014, at 23:03 , Chenling <chenlingantelope@xxxxxxxxx> wrote:
> Is there a way to work this in Mira? 

To answer this question first: not with the current tools. I could think of way 
to code around a couple of things if I had to, but this is not on my current 
agenda.

But read on for reasons I think the approach would just make your life a little 
bit easier.

> I have been trying to assemble endocellular symbiont genomes from sequencing 
> of both the host and the symbiont. So far I have been pooling out reads from 
> the symbiont by mapping them to a closely related strain and getting the 
> reads that map for the assembly. I was just made aware of an extra step that 
> might significantly improve this method. Basically I would calculate the 
> k-mer coverage for the mapped reads, and because my symbiont has much higher 
> coverage than the host (about 10 times higher), once I have the expected 
> coverage I would then filter all my reads by the k-mer coverage criteria. So 
> this would allow me to get reads even if they are divergent from my reference 
> sequence.

So your symbiont is “expected” to be at 10x higher coverage than the host. 
Fine. Please note the quotes in my previous sentence: in the past 15 or so 
years of working tightly with biologists (all of them very good to brilliant 
people), I’ve come to the conclusion that, more often than not, there’s a 
discrepancy between what biologists expect and what the reality is.

But let’s assume that the 10x number is correct. Next problem: coverage 
variance. Some technologies have quite uneven coverages for a number of 
reasons. Taking Illumina as todays most used technology as example: there’s 
quite a problem with GC rich regions and coverage of correct kmers can drop 
there, sometimes dramatically so.

But let’s assume there would be no problem with coverage and we had perfect 
coverage … the genome of your host and its endosymbiont are going to drive you 
nuts because of internal repeats: 10x repeats of the host would look like 
endosymbiont sequence while 2x repeats of the endosymbiont would already be at 
20x the average kmer coverage.

In essence: if I had your task and would go for kmer coverage filtering, I’d 
set a boundary at 5x average coverage and filter reads below that (next problem 
to think of: wrong kmers due to sequencing errors), keep everything else. Which 
may, I concede, already reduce the data quite a bit.

But still, no tools in MIRA to do that atm, I’m sorry.

But another idea: have you already looked at MITObim? That tool might do what 
you’re looking for *if” your endosymbiont is not too large.

B.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: