[mira_talk] Re: mira 3.0.1 crashes in mapping mode

  • From: Tony Travis <a.travis@xxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 30 Mar 2010 10:40:07 +0100

On 30/03/10 00:04, Bastien Chevreux wrote:
On Montag 29 März 2010 Tony Travis wrote:
Looks like I spoke too soon: We are still having problems doing mapping
assemblies of ~40K Sanger reads + 1.2M 454 reads. MIRA crashes with an
[...]
error complaining that 'rr_####<n>####' is longer than 29900 base-pairs:
[...]
Are 'rr_####<n>####' synthetic reads from suspected repeat regions?

Errrm, not quite. RR stands for "Rail-Reads". Synthetic reads which make the
overall life in mapping a bit easier when (mis)using an assembly engine that
was originally built for de-novo.

Hello, Bastien.

Thanks for explaining about 'rr' - I thought they were synthetic reads, but I didn't realise they were for the mapping 'backbone'.

But ... I'm wondering about the length of the rails MIRA built. Say, some of
your Sanger/454 reads wouldn't be some artificial "reads" with a length of>=
15k or so?

Yes, a few are synthetic reads that Rustam constructed manually from the consensus using Consed but these are only about 3-5Kbp.

Can you please send me the complete output log? Then I might know more ... I
don't think it's a bug but it might be something where I can put in some
additional checks.

OK, but...

In the mean time: restart an assembly and force -SB:brl:bro to fixed values.
For "real" data with a Sanger/454 mix I think that -SB:brl=2000:bro=1000
should be ok.

...using the -SB settings you suggested, it's got past the point where it previously crashed and is still running. I'll post a follow-up here to let you know if the mapping assembly succeeds.

[...]
I'm not surprised that the number of padded bases is different, but I am
surprised that MIRA has deleted bases from the read!

MIRA has an integrated editor. It actually *edits* reads (Sanger, 454 and
Solexa) if the situation allows for.

On the other hand, the ACE / phdball combo has no way whatsoever to actually
model these edits. This is the main reason why ACE is *evil* and should not be
used.

I understand that MIRA *can* edit reads, but I thought it only did that if it could access the chromatogram files to check dubious base-calls. I didn't realise that MIRA actually drops bad base-calls from reads. I thought it simply corrected them by base-calling from the chromatogram.

This makes it
impossible to use a 'phd.ball' created from the original 'phd' files, or
the fasta.screen and fasta.screen.qual created by phredPhrap, to load
quality values for the Sanger reads when using Consed to view an 'ace'
file produced by a MIRA assembly. Consed complains that the bases in the

Welcome to the club. I perhaps should try to contact David about supporting
other input formats ...

Yes, it would be good if Consed could read 'caf' files, for example.

[...]
- a hybrid contig was built.
- MIRA sees it is allowed to use contig editing with 454 reads
- it uses 454 reads (only) to build editing hypotheses
- some of them are: "delete this whole column"
- unfortunately, Sanger reads also cover this column and therefore the
   corresponding base there also gets delete.
- hence, some Sanger reads also get edited.

OK, thanks for explaining what is happening.

I do agree that this is somewhat surprising ... need to make this clearer in
the docs.

Yes ;-)

What I want to do is examine a MIRA assembly in Consed with library and
quality information included: Normally, a 'phd.ball' file is used for
this purpose. Newbler, for example, creates a 'phd.ball' automatically
when producing an 'ace' file. Could MIRA create a 'phd.ball' to load
quality scores and library info into Consed for the MIRA ace file?

I suppose it could. I never found the time to analyse the format of these
things though ... there are quite a number of other things I need to tackle
first. I spoiled: as I use gap4 with CAF (and caf2gap), I don't have any of
these problems :-)

In fact, a phd.ball is just a concatenation of phd files in a single file and I created a phd.ball for the 454 fasta+quals using the attached perl script "qual2ball". I've tried caf2gap, but the 32-bit version runs out of memory on our 3.7Mbp genome and the 64-bit version segfaults.

I've never been able to read our Phrap ace files into gap4, even when I visited the Sanger a few years ago and asked James Bonfield for advice. I'm not sure why, but it might be something to do with our read-naming convention that uses '-' characters to improve (human) readability:

  library-plate-well.[cfr]

James could not get it to work on our assemblies either, so we continue to use Consed. However, recently, I've been experimenting with gap5 :-)

Bye,

  Tony.
--
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis@xxxxxxxxxx, http://bioinformatics.rri.sari.ac.uk/~ajt
#! /usr/bin/perl -w
# @(#)qual2ball.pl  2010-03-18  A.J.Travis

#
# Create Phred 'phd.ball' file from FASTA base-calls and quality scores
#

# base-calls
unless ( open( INB, "$ARGV[0]" ) || open( INB, "$ARGV[0].fna" ) ) {
    die "can't open $ARGV[0] or $ARGV[0].fna\n";
}

# quality scores
unless ( open( INQ, "$ARGV[0].qual" ) ) {
    die "can't open $ARGV[0].qual\n";
}

# read initial FASTA header
$headb = <INB>;
chomp($headb);
$headq = <INQ>;
chomp($headq);

while ( !eof INB ) {

    # read base calls
    $nbase = 0;
    while ( $lineb = <INB> ) {
        chomp($lineb);
        if ( $lineb =~ /^>/ ) {
            last;
        }
        @bases = split( //, $lineb );
        foreach $base (@bases) {
            $conb[ $nbase++ ] = $base;
        }
    }

    # read quality scores
    $nqual = 0;
    while ( $lineq = <INQ> ) {
        chomp($lineq);
        if ( $lineq =~ /^>/ ) {
            last;
        }
        @qnos = split( ' ', $lineq );
        foreach $qual (@qnos) {
            $conq[ $nqual++ ] = $qual;
        }
    }

    unless ( ( $lineb =~ /^>/ || eof INB ) && ( $lineq =~ /^>/ || eof INQ ) ) {
        die "failed to parse $ARGV[0]\n";
    }

    # parse FASTA name from header
    ($fname) = ( $headb =~ />(\S+)/ );

    $now = localtime;
    print("BEGIN_SEQUENCE $fname\n\n");
    print("BEGIN_COMMENT\n\n");
    print("CHROMAT_FILE: none\n");
    print("ABI_THUMBPRINT: none\n");
    print("PHRED_VERSION: not called by phred\n");
    print("CALL_METHOD: qual2ball\n");
    print("QUALITY_LEVELS: 99\n");
    print("TIME: $now\n");
    print("CHEM: unknown\n");
    print("DYE: unknown\n");

    print("END_COMMENT\n\n");
    print("BEGIN_DNA\n");

    for ( $base = 0 ; $base < $nbase ; $base++ ) {
        print("$conb[$base] $conq[$base] 0\n");
    }
    print("END_DNA\n\n");
    print("END_SEQUENCE\n");

    $headb = $lineb;
    $headq = $lineq;
}

Other related posts: