[mira_talk] Re: 454 homopolymers

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 22 Mar 2011 20:50:37 +0100

On Tuesday 22 March 2011 12:32:50 Yvan Wenger wrote:
> I'm in a similar situation, although I'm working on cDNA with a large
> eukaryotic transcriptome (without reference). I get a very high
> representation of sequences I know of with Mira, but frequents 1 base
> insertions/deletions when compared to Newbler 2.5 output.
> 
> Finally in my experience, Newbler performs slightly better with the
> sff files are input than with fasta+qual, but the difference is not
> dramatic. I see still more "future frameshift" after in-silico
> translation of mira seqs than after newbler seqs even when the input
> is the same for both.

(here too: which version of MIRA?)

If you had some data (MAF or CAF) with a couple of places with these wrong 
calls, I'd be happy to have a look at whether I can improve consensus calling.

> Finally about the difference between chromatograms and fasta(+qual), I
> was wondering if there is any tool allowing to remove adapters/vector
> sequences directly in the sff or xml file used by mira? The problem
> here is that my sff file is correct, but some prior adapters used for
> normalisation are still in the sequences.

Use the SSAHA2/SMALT clipping functions of MIRA. In short: just use 
FASTA+QUAL+XML as you do normally, but tell MIRA you have some more info in 
ssaha2/smalt format to look at.

And the ssaha2/smalt you should created by running your sequences against the 
adaptor.

Note: nowadays I recommend to use SMALT and not SSAHA2 anymore.

B.

PS: due to the fact that I had to implement some adaptor screening for Solexa, 
I think one of the next versions will have a facility to have MIRA perform 
this kind of screening for any sequencing tech.
PPS: related question: when screening for your adaptors in 454, are they of 
type
a) when an adaptor occurs, mask it and everything to the right of the read
or
b)
a) when an adaptor occurs, mask it and everything to the left of the read
or
c) both a) and b) ?

Other related posts: