[mira_talk] Re: assembly runs out of temp space

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 23 Apr 2012 23:19:00 +0200

On Apr 23, 2012, at 14:06 , Adam Witney wrote:
> I am having trouble assembling a bacterial sequence without running out of 
> disk space on my non-NFS drive (it fills up the 48Gb of available space).

Hi Adam,

you'll hate me for this, but I simply have to make that joke: "get more disk 
space?" But read on for more practical help ;-)

> I have 2934687 reads using the 200bp IonTorrent kit, for a ~3Mb bacterial 
> genome. I have cut the Hash statistics out of the log_assembly (this file is 
> about 450Mb)

The first 5k lines of the log_assembly would have been nice.

> and put them here:
> I think it is probably related to the error profile of the reads, the quality 
> scores drop off quite quickly along the read. I have put the fastqc report 
> here:

First things first: which hash statistics is that? The first, before the first 
clipping or the one before the second clipping or the one after that? In any 
way, I don't like that hash statistics file at all. Either some (unknown?) 
adaptors are still present or (maybe) homopolymer sequencing artifacts.

I've never worked with FASTQC, but 

  http://bugs.sgul.ac.uk/temp/66493_in.iontor_fastqc/fastqc_report.html#M3

lets me think that everyone of your sequences starts with "tcag", and if I am 
not mistaken that's the last part of the adaptor. Did you make sure you used 
"sff_extract" either with the "-c" option to clip sequences or (preferred) made 
sure that MIRA has been reading the accompanying XML file? Because if not, this 
would partly explain the behaviour you are seeing.


If you are sure MIRA got the clips from XML (or clipped reads from the start), 
on to possible solutions for you:
1) use MIRA 3.9.0. I also had problems with large hash statistics files and 
rewrote the code. The hash statistics there uses more memory, is a tad slower, 
but substantially slashes the amount of needed disk space. In case you do not 
want to use 3.9.0 for the assembly, you can still use it for preprocessing only 
and use the then clipped reads in 3.4.x ... maybe that would be enough
2) as last resort only: perform yourself a clip at 200bp ... at least that is 
what I would try when seeing
   http://bugs.sgul.ac.uk/temp/66493_in.iontor_fastqc/fastqc_report.html#M3
  Maybe even trim somewhere between 150 and 200bp, this is what 
   http://bugs.sgul.ac.uk/temp/66493_in.iontor_fastqc/fastqc_report.html#M10
  tells me.

Hope this helps and please do tell how it works out for you (or not).

Best,
  Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: