Dear Iddo,There's another issue with your data: it looks noisy. Look at this section of the log file:
Measured avg. frequency coverage: 1014 Deduced thresholds: ------------------- Min normal cov: 405.6 Max normal cov: 1622.4 Repeat cov: 1926.6 Heavy cov: 8112.0 Crazy cov: 20280.0 Mask cov: 101400 Repeat ratio histogram: ----------------------- 0 5028189 1 837017 2 269532 3 37454 4 4716The repeat ratio histogram of a clean sequence file from a genome sequencing with decent coverage will show the "1" bin to be the biggest. The fact that the 0 bin is biggest is a sign that your sequences are filled with random sequence not from your bacteria. That's even more true with such a high coverage.
Regards, Bob Bastien Chevreux wrote:
On Monday 15 August 2011 19:36:32 Iddo Friedberg wrote: > Oops. I put up the wrong logfile. The run was definitely not on an NFS > systemHeavens! Is there any valid reason you set up an assembly with a coverage >= 1000x ? No, not ten, not one hundred ... one thousand! You are aware that this actually decreases the quality of a genome assembly, right? Non-random errors in the sequencing will be the death of it.You might want to read quickly through http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_seqadv_a_word_or_two_on_coverageespecially the small paragraph labelled with a nice, warm and re-assuring "Warning".Back to your project: slash down the amount of data by a factor of ten and all will be well :-)B.PS: and I'm actually now thinking of adding another warning flag which will let MIRA stop if it detects a coverage >= 150x in genome de-novo ... anyone having an oppinion on this?
begin:vcard fn:Robert Bruccoleri n:Bruccoleri;Robert org:Audacious Energy, LLC and Congenomics, LLC adr:;;;;;;USA email;internet:bruc@xxxxxxx title:President version:2.1 end:vcard