Thanks Bastien!
Will try to keep my chunks at a coverage > 30x then!
cheers,
Christoph
On 27/08/2015 01:57, Bastien Chevreux wrote:
On 25 Aug 2015, at 8:33 , Christoph Hahn <chrisi.hahni@xxxxxxxxx> wrote:
I am using MIRA to great affect for clipping/trimming my Illumina data forThat is correct: splitting the data set will affect trimming.
small to medium sized eukaryotic genomes (via nop=0 and then extracting the
trimmed reads from the resulting maf file with miraconvert).
For some larger datasets however I run out of memory, so I was thinking of
simply splitting the data into smaller portions and running each subset through
MIRA separately, and then merging all resulting files in the end.
I am worried that this might actually affect the trimming as I am assuming that
MIRA uses also kmer frequencies for making decisions about what is to be
removed (?), so I wanted to run it by you first.
Writing a standalone clipping tool which does not need to store the sequencing
data in memory is on my list since quite a while. I already started changing a
couple of routines for this, however I haven’t finished this and it may still
take a while … so don’t hold your breath.
In the mean time:
- for genomic data, if you can split it into chunks giving an average coverage of at
least >= 30x (better: > 40-50x), you should be fine. Going below 20x will
probably result is losing some difficult areas (bidirectional GGCxG motifs and such).
- for RNASeq Illumina 100mers, running with 20 (better 30) million read-pairs
would keep all but the rarest transcripts (which, in fact are almost
indistinguishable from sequencing errors)
Hope that helps,
B.