[mira_talk] Re: High coverage data set, RAM limitations, subsampling.

From: John Nash <john.he.nash@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Fri, 10 May 2013 12:13:59 -0400

On 2013-05-10, at 12:02 PM, Felipe Gajardo <felipe.gajardo.e@xxxxxxxxx> wrote:

> Hi Bastien and everybody,
> 
> I have a data set from an IonTorrent sequencing of a bacterial genome (target 
> size ~2,5 Mb). Throughput is 757406080 bp so i have ~300X of coverage. the 
> data set contains two libraries merged. A mate-paired library with avg. read 
> size ~50 bp (I already get off the internal adaptor sequence); and a 
> fragments library with an avg. read size of ~200 bp. I tried to assemble this 
> data, but i just have 8Gb of RAM, so MIRA crushes after a few minutes 
> working. Then i decided to take a subset of the reads until obtain 100X 
> coverage and then assemble (but this time without a traceinfo file, because i 
> did not generate it to the subset). I took the whole mate-paired library 
> (~20% of the data set) and part of the fragments library.
> 
> $ mira --project=B0P1-8 --job=denovo,genome,accurate,
> iontor --notraceinfo
> 
> MIRA successfully assembled the data, obtaining:
> 
> Large contigs
> ===========
>   Length assessment:
>   ------------------
>   Number of contigs:    154
>   Total consensus:      2466152
>   Largest contig:       325074
>   N50 contig size:      55150
>   N90 contig size:      9797
>   N95 contig size:      3942
> 
> All contigs:
> ============
>   Length assessment:
>   ------------------
>   Number of contigs:    1375
>   Total consensus:      2810830
>   Largest contig:       325074
>   N50 contig size:      44588
>   N90 contig size:      365
>   N95 contig size:      283
> 
> Now i have some questions: 
> Is there a way to include the reads i left out of the assemble to complete it 
> (considering my RAM limitations)?
> Does know MIRA that some reads are mate-paired if not having the traceinfo 
> file?
> Could be a better approach make an assembly of a subset including exclusively 
> reads from the fragments library and after that, use the mate-paired 
> information to give order to the contigs obtained?

You do not have to trim down the traceinfo file.  I routinely leave it as 
generated.  Bastien told me that Mira only uses it for the read names found in 
the downsized fastq file…

I have a question back for you. Many of us downsize our read sets to the "sweet 
spot" of coverage desired for the specific technology (e.g. 40x for 454, 70x or 
so for Illumina, not sure what it is for IonTorrent and PacBio). Do you think 
that there is data in the remaining 200X that you did not use which will help 
your assembly further than the initial assembly?  I must admit that I have been 
lazy and not tried it out, simply assuming that the first assembly with the 
optimal read coverage would suffice.

J

Follow-Ups:
- [mira_talk] Re: High coverage data set, RAM limitations, subsampling.
  - From: Bastien Chevreux

References:
- [mira_talk] High coverage data set, RAM limitations, subsampling.
  - From: Felipe Gajardo

[mira_talk] Re: High coverage data set, RAM limitations, subsampling.

Other related posts: