[mira_talk] Re: 0.5TB not enough space?

  • From: Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 15 Aug 2011 14:37:53 -0400

Dear Iddo,
There's another issue with your data: it looks noisy. Look at this section of the log file:

Measured avg. frequency coverage: 1014

Deduced thresholds:
Min normal cov: 405.6
Max normal cov: 1622.4
Repeat cov: 1926.6
Heavy cov: 8112.0
Crazy cov: 20280.0
Mask cov: 101400

Repeat ratio histogram:
0       5028189
1       837017
2       269532
3       37454
4       4716

The repeat ratio histogram of a clean sequence file from a genome sequencing with decent coverage will show the "1" bin to be the biggest. The fact that the 0 bin is biggest is a sign that your sequences are filled with random sequence not from your bacteria. That's even more true with such a high coverage.


Bastien Chevreux wrote:

On Monday 15 August 2011 19:36:32 Iddo Friedberg wrote:

> Oops. I put up the wrong logfile. The run was definitely not on an NFS

> system

Heavens! Is there any valid reason you set up an assembly with a coverage >= 1000x ? No, not ten, not one hundred ... one thousand! You are aware that this actually decreases the quality of a genome assembly, right? Non-random errors in the sequencing will be the death of it.

You might want to read quickly through


especially the small paragraph labelled with a nice, warm and re-assuring "Warning".

Back to your project: slash down the amount of data by a factor of ten and all will be well :-)


PS: and I'm actually now thinking of adding another warning flag which will let MIRA stop if it detects a coverage >= 150x in genome de-novo ... anyone having an oppinion on this?

fn:Robert Bruccoleri
org:Audacious Energy, LLC and Congenomics, LLC

Other related posts: