[mira_talk] Re: 0.5TB not enough space?

Dear Iddo,
There's another issue with your data: it looks noisy. Look at this section of the log file:

Measured avg. frequency coverage: 1014

Deduced thresholds:
-------------------
Min normal cov: 405.6
Max normal cov: 1622.4
Repeat cov: 1926.6
Heavy cov: 8112.0
Crazy cov: 20280.0
Mask cov: 101400

Repeat ratio histogram:
-----------------------
0       5028189
1       837017
2       269532
3       37454
4       4716


The repeat ratio histogram of a clean sequence file from a genome sequencing with decent coverage will show the "1" bin to be the biggest. The fact that the 0 bin is biggest is a sign that your sequences are filled with random sequence not from your bacteria. That's even more true with such a high coverage.

Regards,
Bob

Bastien Chevreux wrote:

On Monday 15 August 2011 19:36:32 Iddo Friedberg wrote:

> Oops. I put up the wrong logfile. The run was definitely not on an NFS

> system


Heavens! Is there any valid reason you set up an assembly with a coverage >= 1000x ? No, not ten, not one hundred ... one thousand! You are aware that this actually decreases the quality of a genome assembly, right? Non-random errors in the sequencing will be the death of it.


You might want to read quickly through

http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_seqadv_a_word_or_two_on_coverage


especially the small paragraph labelled with a nice, warm and re-assuring "Warning".


Back to your project: slash down the amount of data by a factor of ten and all will be well :-)


B.


PS: and I'm actually now thinking of adding another warning flag which will let MIRA stop if it detects a coverage >= 150x in genome de-novo ... anyone having an oppinion on this?




begin:vcard
fn:Robert Bruccoleri
n:Bruccoleri;Robert
org:Audacious Energy, LLC and Congenomics, LLC
adr:;;;;;;USA
email;internet:bruc@xxxxxxx
title:President
version:2.1
end:vcard

Other related posts: