[mira_talk] Lots of contigs, then segmentation fault

  • From: Egon Ozer <e-ozer@xxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 18 Apr 2011 16:07:06 -0500

I must be doing something wrong...

I'm trying to do a hybrid assembly project using 2,008,245 paired 454 reads and 
5,722,894 PE Illumina reads.  The Illumina reads are 53 bp (short, I know, but 
that's what I have to work with) and corresponds to about 45x coverage of my 
organism.  

The program runs 6 and a half hours, then seg faults.  When I check the log 
output, I see that it was working on contig # 24994 (!) when it died.

The seg fault is one problem which comes and goes.  I have been able to run 
Mira to completion before on version 3.2.1.11 (after just restarting the run 
after the segmentation fault), but the bigger problem is that with that run I 
produced 80126 contigs, the largest of which was 1526 bases with an N50 of 97.  

What am I doing wrong? Why does assembly of the 454 reads alone in Celera 
assembler produce 49 contigs with an average size of 137,888 bases and assembly 
of the Illumina reads alone in Velvet produces 595 contigs with an N50 of 
122385, but my Mira output is so miserable (when it actually finishes).  

Please give me a hand.  I have been really counting on hybrid assembly to help 
me get as complete and accurate a sequence for my bacterium as possible, but I 
seem to be missing the mark on something.

Thanks.

- E

Supplemental information:

My most recent attempt at this assembly was on Mira version 
3.12.1.15_dev_darwin10.6.0_x86_64_static (but I've had this same problem with 
verions 3.2.1.5, 3.2.1.7, and 3.2.1.11 as well).  

Here's how I prepare my files for mira:

for 454 reads:
sff_extract_0_2_8.py -s out.fasta -q out.qual -x out.xml -l linkers.fa -i 
"insert_size:3000,insert_stdev:900" in1.sff in2.sff
ln -s out.fasta proj_in.454.fasta
ln -s out.qual proj_in.454.fasta.qual
ln -s out.xml proj_traceinfo_in.454.xml

for Illumina reads:
cat s_8_1_sequence.txt s_8_2_sequence.txt > combined.fastq
ln -s proj_in.solexa.fastq

my command line:
mira --project=proj --job=denovo,genome,accurate,454,solexa -GE:not=16 
SOLEXA_SETTINGS -GE:tismin=150:tismax=350 454_SETTINGS -DP:ure=1 -CL:emrc=1 
>&log_assembly.txt

I'm using a MacPro running Snow Leopard with 64G of RAM and 1.65 TB of free 
hard drive space.

Here's the last little bit of my log file right before the fault:
-------------- Contig statistics ----------------
Contig id: 24994
Contig length: 86

                      Sanger         454      PacBio      Solexa       Solid
Num. reads                 0           0           0          27           0
100% merged reads          -           -           -           0           0
Avg. read len              0           0           0          51           0
Max. coverage              0           0           0          27           0
Avg. coverage          0.000       0.000       0.000      16.023       0.000

Max. contig coverage: 27
Avg. contig coverage: 16.023

Consensus contains:     A: 14   C: 33   G: 27   T: 12   N: 0
                        IUPAC: 0        Funny: 0        *: 0
GC content: 69.767%
-------------------------------------------------
Timing BFC cout constats: 228
Localtime: Mon Apr 18 15:00:16 2011
bfc 10/0
Timing BFC edit tricky1: 1
Marking possibly misassembled repeats:  [0%] ....|.... [10%] ....|.... [20%] 
....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] 
....|.... [80%] ....|.... [90%] ....|.... [100%] done step 1, starting step 
2:done. Found none.
Timing BFC mark reps: 515
bfc 11/0
bfc 12/0
Timing BFC delPSHP: 1
bfc 13/0
bfc 14/0
bfc 15/0
bfc 16/0
Transfering reads to readpool.
Timing BFC rp transfer: 102
Done.
bfc 17/0
bfc 19
Storing contig ... 10Searching for: SROs UNSs IUPACs, preparing needed data: 
sorting tags ... fetching consensus for strain0 ...done.
Starting search:
done with search
Transfering tags to readpool.
Saving temp CAF ... done.
done.
Timing BFC store con: 1250
Timing BFC loop total: 11861
bfc 1
Localtime: Mon Apr 18 15:00:16 2011

Timing BFC unused: 32509
Unused: 2326118
AS_used_ids.size(): 7731139
bfc 2
Timing BFC prelim1: 7
bfc 3
bfc 4
bfc 5
Timing BFC setup AS_used_ids: 1
bfc 6/0
Timing BFC discard con: 3
bfc 7/0
Building new contig 24995
Localtime: Mon Apr 18 15:00:16 2011
Unused reads: 2326118
bfc 8/0
assemblymode_mapping: 0
use genomic pathfinder: 1
Timing n4_basicCSBSSetup cleararrays: 1522
Timing n4_basicCSBSSetup init pf_banned: 0
Timing n4_basicCSBSSetup total: 1530
--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: