[mira_talk] Re: mira for trimming

  • From: Christoph Hahn <chrisi.hahni@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 27 Aug 2015 10:54:43 +0100

Thanks Bastien!

Will try to keep my chunks at a coverage > 30x then!

cheers,
Christoph

On 27/08/2015 01:57, Bastien Chevreux wrote:

On 25 Aug 2015, at 8:33 , Christoph Hahn <chrisi.hahni@xxxxxxxxx> wrote:
I am using MIRA to great affect for clipping/trimming my Illumina data for
small to medium sized eukaryotic genomes (via nop=0 and then extracting the
trimmed reads from the resulting maf file with miraconvert).
For some larger datasets however I run out of memory, so I was thinking of
simply splitting the data into smaller portions and running each subset through
MIRA separately, and then merging all resulting files in the end.
I am worried that this might actually affect the trimming as I am assuming that
MIRA uses also kmer frequencies for making decisions about what is to be
removed (?), so I wanted to run it by you first.
That is correct: splitting the data set will affect trimming.

Writing a standalone clipping tool which does not need to store the sequencing
data in memory is on my list since quite a while. I already started changing a
couple of routines for this, however I haven’t finished this and it may still
take a while … so don’t hold your breath.

In the mean time:
- for genomic data, if you can split it into chunks giving an average coverage of at
least >= 30x (better: > 40-50x), you should be fine. Going below 20x will
probably result is losing some difficult areas (bidirectional GGCxG motifs and such).
- for RNASeq Illumina 100mers, running with 20 (better 30) million read-pairs
would keep all but the rarest transcripts (which, in fact are almost
indistinguishable from sequencing errors)

Hope that helps,
B.


--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html

Other related posts: