[mira_talk] Re: mira 3.0.1 crashes in mapping mode

  • From: Tony Travis <a.travis@xxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 29 Mar 2010 17:26:13 +0100

On 15/03/10 11:50, Tony Travis wrote:
[...]
Please note that mapping assemblies made with 3.0.1 and 3.0.2 will have broken
ACE, CAF and MAF files because of this error (provided they ran through).

Hello, Bastien.

OK, I re-ran the mapping assembly over the weekend using 3.0.3, and it
completed successfully.

Hello, Bastien.

Looks like I spoke too soon: We are still having problems doing mapping assemblies of ~40K Sanger reads + 1.2M 454 reads. MIRA crashes with an error complaining that 'rr_####<n>####' is longer than 29900 base-pairs:

mira -project=map \
        -fasta \
        -job=mapping,genome,sanger,454 \
        -SB:lb=on -FN:bbin=all_backbone_in.fasta \
        SANGER_SETTINGS -LR:rns=fr,mxti=on -FN:xtii=all_traceinfo_in.sanger.xml 
\
        454_SETTINGS 
-FN:fai=all_in.454.fasta,fqui=all_in.454.fasta.qual,xtii=all_traceinfo_in.454.xml
This is MIRA V3.0.3 (production version).
...
Fatal Error (may be due to problems of the input data):
"Read rr_####409#### is longer than 29900 bases. SKIM cannot handle this, 
aborting.
"

->Thrown: uint32 Skim::computePartition(uint32 maxmemusage, bool 
computenumpartitions)

->Caught: main

Program aborted, probably due to error in the input data or parametrisation.
Please check the output log for more information.
For help, please write a mail to the mira talk mailing list.

CWD: /home/rri20/genome/Roseburia/hominis/MIRA/current
make: *** [map] Aborted

Are 'rr_####<n>####' synthetic reads from suspected repeat regions?

I tried using '-SK:bph=31' but it made no difference...

I also had problems trying to read Sanger quality scores for reads in a MIRA assembly into Consed from a phd.ball that is consistent with a Phrap assembly - I used the same FASTA sequence and quality files for both assemblies, but when I compare the reads in the Phrap and MIRA ace files they are slightly different and I can't explain why.

I'm using the phredPhrap pipeline to vector screen our Sanger reads, then assemble the fasta.screen and fasta.screen.qual files created by the pipeline using both Phrap and MIRA to compare the assemblies. I expected the reads, not the contigs, to be the same in both ace files but they are not: I used "readseq -degap=\*" to reformat the sequences from the same read in each ace file in unpadded GenBank format to make it easier to compare them. For example:

diff Phrap.read.gb MIRA.read.gb
1,2c1,2
< LOCUS       RA10-p04-K04_f   1286 bp NUCLEOTIDE             UNA       
29-Mar-2010
< DEFINITION  RA10-p04-K04.f 1286 0 0
---
> LOCUS       RA10-p04-K04_f   1151 bp NUCLEOTIDE             UNA       
29-Mar-2010
> DEFINITION  RA10-p04-K04.f 1151 0 0
18,22c18,22
<       841 TTAATTCAAT GTCTAATCGT TCTATAGTTT TCTTGTTTTT CACTCCAGTC TTTACGTTTT
<       901 ATCACTTTTA ATTTTTGCTA AAGCCCTAAT AATCTCAATA TAATATTGCT TCCGTGCATT
<       961 ATCCGTATTC CGCATTAACA GCTCAAAACC ATTTTCTAAG CCCAAAGAAT GGTTTTTATT
<      1021 TTTTCAAAAG TCTTTAGTTT AAAATTTTTA ATCCCTAAAC CCAACCTCTC TCCCAATTAC
<      1081 CAATTAACTA ATTTCATTAA AAGGGAATTT AAAATTTCTT GGAATTACCC CGAAAAACA
---
>       841 TTAATTCAAT GTCTAATCGT TCTATAGTTT TCTTGTTTTC ACTCCAGTCT TTACGTTTTA
>       901 TCACTTTTAT TTTTGCTAAA GCCTAATAAT CTCAATATAA TATTGCTTCC GTGCATTATC
>       961 CGTATTCCGC ATTAACAGCT CAAAACCATT TTCTAAGCCC AAAGAATGGT TTTTATTTTT
>      1021 TCAAAAGTCT TTAGTTTAAA ATTTTTAATC CCTAAACCCA ACCTCTCTCC CAATTACCAA
>      1081 TTAACTAATT TCATTAAAAG GGAATTTAAA ATTTCTTGGA ATTACCCCGA AAAACA

I'm not surprised that the number of padded bases is different, but I am surprised that MIRA has deleted bases from the read! This makes it impossible to use a 'phd.ball' created from the original 'phd' files, or the fasta.screen and fasta.screen.qual created by phredPhrap, to load quality values for the Sanger reads when using Consed to view an 'ace' file produced by a MIRA assembly. Consed complains that the bases in the ace file differ from the bases in the 'phd' file:

ace file: all_out.ace
Version 19.0 (090206)
exception thrown: ace file says base at nPaddedSequencePos 893 or nBases 880 is
c but PHD file says t which on top strand is t for read RA10-p04-K04.f at line 
23206 in file of phd files phd.ball

The reads in the MIRA ace file are consistent with the checkpointed MIRA read-pool, but the read-pool itself is different to the FASTA file read in by MIRA:

diff Fasta.read.gb chkpt.read.gb
1,4c1
< LOCUS       RA10-p04-K04_f   1139 bp    DNA             UNA       29-Mar-2010
< DEFINITION  RA10-p04-K04.f  CHROMAT_FILE: RA10-p04-K04.f PHD_FILE:
<             RA10-p04-K04.f.phd.1 CHEM: term DYE: big TIME: Fri Mar 12 
15:39:23<             2010 TEMPLATE: RA10-p04-K04 DIRECTION: fwd
---
> LOCUS       chkpt_read_1            1136 bp
20,24c17,25
<       841 TTAATTCAAT GTCTAATCGT TCTATAGTTT TCTTGTTTTT CACTCCAGTC TTTACGTTTT
<       901 ATCACTTTTA ATTTTTGCTA AAGCCCTAAT AATCTCAATA TAATATTGCT TCCGTGCATT
<       961 ATCCGTATTC CGCATTAACA GCTCAAAACC ATTTTCTAAG CCCAAAGAAT GGTTTTTATT
<      1021 TTTTCAAAAG TCTTTAGTTT AAAATTTTTA ATCCCTAAAC CCAACCTCTC TCCCAATTAC
<      1081 CAATTAACTA ATTTCATTAA AAGGGAATTT AAAATTTCTT GGAATTACCC CGAAAAACA
---
>       841 TTAATTCAAT GTCTAATCGT TCTATAGTTT TCTTGTTTTC ACTCCAGTCT TTACGTTTTA
>       901 TCACTTTTAT TTTTGCTAAA GCCTAATAAT CTCAATATAA TATTGCTTCC GTGCATTATC
>       961 CGTATTCCGC ATTAACAGCT CAAAACCATT TTCTAAGCCC AAAGAATGGT TTTTATTTTT
>      1021 TCAAAAGTCT TTAGTTTAAA ATTTTTAATC CCTAAACCCA ACCTCTCTCC CAATTACCAA
>      1081 TTAACTAATT TCATTAAAAG GGAATTTAAA ATTTCTTGGA ATTACCCCGA AAAACA
> //
> LOCUS       chkpt_read_2               1 bp
> ORIGIN
>         1 N

The chromatogram 'scf' files are not accessible to MIRA, and I did not expect any automatic editing of the reads to be done:

  Edit options (-ED):
        Automatic contig editing (ace)              :  [san]  no
                                                       [454]  yes
     Sanger only:
        Strict editing mode (sem)                   : no
        Confirmation threshold in percent (ct)      : 50


Am I misunderstanding something about how MIRA manages its reads?

What I want to do is examine a MIRA assembly in Consed with library and quality information included: Normally, a 'phd.ball' file is used for this purpose. Newbler, for example, creates a 'phd.ball' automatically when producing an 'ace' file. Could MIRA create a 'phd.ball' to load quality scores and library info into Consed for the MIRA ace file?

Thanks,

  Tony.
--
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis@xxxxxxxxxx, http://bioinformatics.rri.sari.ac.uk/~ajt

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: