[mira_talk] Re: Alignment errors in Solexa mapping

From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Thu, 10 Dec 2009 14:10:14 +0100
Sorry , the last example was a poor represenetative, the errors are more often 
of this type:

> A similar error is quite common (mostly at the end of reads), at sites where 
> there are (homopolymer) frameshifts in the reference genome:

gggcttgctgcaatawtttttttcggtttatatcgtctagtaattcaaaatggcca (consensus)
gggcttgctgcaataa*ttttttcggtttatatcgtctagtaattcaaaatggcca (reference/backbone)
            ataatttttttcggtttatatcgtctagtaattcaaaa
              at*ttttttcggtttatatcgtctagtaattcaaaatgg
               t*ttttttcggtttatatcgtctagtaattcaaaatggc

> Again, MIRA overpredicts diversity, since all three reads are identical, and 
> there should be no a/t ambiguity (w) in the consensus in this case. 

B


On Thu, 10 Dec 2009 13:34:42 +0100
Björn Nystedt <bjorn.nystedt@xxxxxxxxx> wrote:

> Hi all, 
> having fun learning to work with Solexa reads :) 
> 
> I was mapping Solexa reads to a reference genome (= a single contig from an 
> earlier MIRA assembly) using MIRA V3rc4d (sorry Bastien, missed the strain 
> info again..)
> 
> mira --project=BAnh1IrMREF07 --job=mapping,genome,accurate,solexa 
> -OUT:ora=yes -GE:not=3 -AS:
> urd=yes -SB:lb=yes:bft=fasta:bbq=30 SOLEXA_SETTINGS -LR:ft=fastq:fqqo=64 
> -CO:msr=yes -GE:uti=
> no:tismin=200:tismax=400
> 
> Going through the alignments, I noticed some odd parts, such as this:
> 
> attggcgtgaagagtttgaatgagattgattgmmwgccagagaggccagaatatgttacaccgccgtcga 
> (consensus)
> attggcgtgaagagtttgaatgagattgattgcmagccagagaggccagaatatgttacaccgccgtcga 
> (reference/backbone)
> attggcgtgaagagtttgaatgagattgattgccag**agn
> attggcgtgaagagtttgaatgagattgattgccag**agn
> attggcgtgaagagtttgaatgagattgattgccag**agn
>  ttggcgtgaagagtttgaatgagattgattgccag**agan
>   tggcgtgaagagtttgaatgagattgattgccag**agagn
>    ggcgtgaagagtttgaatgagattgattgccag**agag*gn
>    ggcgtgaagagtttgaatgagattgattgccag**agag*gn
>    ggcgtgaagagtttgaatgagattgattgccag**agag*gn
>       gtgaagagtttgaatgagattgatt****gccagagaggccan
>       gtgaagagtttgaatgagattgatt****gccagagaggccan
>       gtgaagagtttgaatgagattgatt****gccagagaggccan
>       gtgaagagtttgaatgagattgatt****gccagagaggccan
>        tgaagagtttgaatgagattgatt****gccagagaggccagn
>        tgaagagtttgaatgagattgatt****gccagagaggccagn
>                            gattgattgccagagaggccagaatatgttacaccgccn
>                                 attgccagagaggccagaatatgttacaccgccgtcga
> 
> 
> (Note that these are only a few selected reads in the alignment; we have 
> ~120X at this site which is normal given the amount of data). As you see, 
> there are 3 predicted variants (all three are present in both strand 
> directions):
> 
> 1) ttgccag**agag*g (30 reads)
> 2) tt****gccagagag (70 reads)
> 3) ttgattgccagagag (20 reads)
> 
> This is an overprediction of the diversity, since the two top reads are 
> really identical.
> 
> Anyone seen this behaviour before? 
> Could it be caused by the wildcard charachers in the reference (although 
> similar things sometimes happens also when there are no wildcards)?
> Any ideas on strategies/parameters to avoid it? 
> 
> 
> A similar error is quite common (mostly at the end of reads), at sites where 
> there are (homopolymer) frameshifts in the reference genome:
> 
> aaagacgataaaaaatatcgcaatgaaaaataaaaaagawtttttttgttttatc(consensus)
> aaagacgataaaaaatatcgcaatgaaaaat*aaaaagaatttttttgttttatcac (reference/backbone)
>       nataaaaaatatcgcaatgaaaaataaaaaag*atttttt
>         naaaaaatatcgcaatgaaaaataaaaaagatttttttt
> 
> Again, MIRA overpredicts diversity, since these two reads are identical, and 
> there should be no a/t ambiguity (w) in the consensus in this case. 
> 
> Any help or comments are welcome!
> B
> 
> PS. In our case the problems below can likely be avoided by doing a de-novo 
> or de-novo-like assembly with all our current data, but we are working on a 
> range of strategies at the moment to learn about different possibilities, and 
> I thought the aboce behaviours were of general interest and worth discussing 
> on the list.
> 
> PS2. I am a bit confused regarding the initial n:s of the reads, not sure yet 
> why they appear...
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ====================================
> Björn Nystedt, PhD
> Molecular Evolution
> EBC, Uppsala University
> Norbyv. 18C, 752 36  Uppsala
> Sweden
> phone: +46 (0)18-471 45 88
> email: Bjorn.Nystedt@xxxxxxxxx
> ====================================
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


-- 
====================================
Björn Nystedt, PhD
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36  Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
References:
- [mira_talk] Alignment errors in Solexa mapping
  - From: Björn Nystedt
[mira_talk] Re: Alignment errors in Solexa mapping

Other related posts: