On 22 Aug 2014, at 23:03 , Chenling <chenlingantelope@xxxxxxxxx> wrote: > Is there a way to work this in Mira? To answer this question first: not with the current tools. I could think of way to code around a couple of things if I had to, but this is not on my current agenda. But read on for reasons I think the approach would just make your life a little bit easier. > I have been trying to assemble endocellular symbiont genomes from sequencing > of both the host and the symbiont. So far I have been pooling out reads from > the symbiont by mapping them to a closely related strain and getting the > reads that map for the assembly. I was just made aware of an extra step that > might significantly improve this method. Basically I would calculate the > k-mer coverage for the mapped reads, and because my symbiont has much higher > coverage than the host (about 10 times higher), once I have the expected > coverage I would then filter all my reads by the k-mer coverage criteria. So > this would allow me to get reads even if they are divergent from my reference > sequence. So your symbiont is “expected” to be at 10x higher coverage than the host. Fine. Please note the quotes in my previous sentence: in the past 15 or so years of working tightly with biologists (all of them very good to brilliant people), I’ve come to the conclusion that, more often than not, there’s a discrepancy between what biologists expect and what the reality is. But let’s assume that the 10x number is correct. Next problem: coverage variance. Some technologies have quite uneven coverages for a number of reasons. Taking Illumina as todays most used technology as example: there’s quite a problem with GC rich regions and coverage of correct kmers can drop there, sometimes dramatically so. But let’s assume there would be no problem with coverage and we had perfect coverage … the genome of your host and its endosymbiont are going to drive you nuts because of internal repeats: 10x repeats of the host would look like endosymbiont sequence while 2x repeats of the endosymbiont would already be at 20x the average kmer coverage. In essence: if I had your task and would go for kmer coverage filtering, I’d set a boundary at 5x average coverage and filter reads below that (next problem to think of: wrong kmers due to sequencing errors), keep everything else. Which may, I concede, already reduce the data quite a bit. But still, no tools in MIRA to do that atm, I’m sorry. But another idea: have you already looked at MITObim? That tool might do what you’re looking for *if” your endosymbiont is not too large. B. -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html