[mira_talk] Alignment errors in Solexa mapping

  • From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 10 Dec 2009 13:34:42 +0100

Hi all, 
having fun learning to work with Solexa reads :) 

I was mapping Solexa reads to a reference genome (= a single contig from an 
earlier MIRA assembly) using MIRA V3rc4d (sorry Bastien, missed the strain info 
again..)

mira --project=BAnh1IrMREF07 --job=mapping,genome,accurate,solexa -OUT:ora=yes 
-GE:not=3 -AS:
urd=yes -SB:lb=yes:bft=fasta:bbq=30 SOLEXA_SETTINGS -LR:ft=fastq:fqqo=64 
-CO:msr=yes -GE:uti=
no:tismin=200:tismax=400

Going through the alignments, I noticed some odd parts, such as this:

attggcgtgaagagtttgaatgagattgattgmmwgccagagaggccagaatatgttacaccgccgtcga 
(consensus)
attggcgtgaagagtttgaatgagattgattgcmagccagagaggccagaatatgttacaccgccgtcga 
(reference/backbone)
attggcgtgaagagtttgaatgagattgattgccag**agn
attggcgtgaagagtttgaatgagattgattgccag**agn
attggcgtgaagagtttgaatgagattgattgccag**agn
 ttggcgtgaagagtttgaatgagattgattgccag**agan
  tggcgtgaagagtttgaatgagattgattgccag**agagn
   ggcgtgaagagtttgaatgagattgattgccag**agag*gn
   ggcgtgaagagtttgaatgagattgattgccag**agag*gn
   ggcgtgaagagtttgaatgagattgattgccag**agag*gn
      gtgaagagtttgaatgagattgatt****gccagagaggccan
      gtgaagagtttgaatgagattgatt****gccagagaggccan
      gtgaagagtttgaatgagattgatt****gccagagaggccan
      gtgaagagtttgaatgagattgatt****gccagagaggccan
       tgaagagtttgaatgagattgatt****gccagagaggccagn
       tgaagagtttgaatgagattgatt****gccagagaggccagn
                           gattgattgccagagaggccagaatatgttacaccgccn
                                attgccagagaggccagaatatgttacaccgccgtcga


(Note that these are only a few selected reads in the alignment; we have ~120X 
at this site which is normal given the amount of data). As you see, there are 3 
predicted variants (all three are present in both strand directions):

1) ttgccag**agag*g (30 reads)
2) tt****gccagagag (70 reads)
3) ttgattgccagagag (20 reads)

This is an overprediction of the diversity, since the two top reads are really 
identical.

Anyone seen this behaviour before? 
Could it be caused by the wildcard charachers in the reference (although 
similar things sometimes happens also when there are no wildcards)?
Any ideas on strategies/parameters to avoid it? 


A similar error is quite common (mostly at the end of reads), at sites where 
there are (homopolymer) frameshifts in the reference genome:

aaagacgataaaaaatatcgcaatgaaaaataaaaaagawtttttttgttttatc(consensus)
aaagacgataaaaaatatcgcaatgaaaaat*aaaaagaatttttttgttttatcac (reference/backbone)
      nataaaaaatatcgcaatgaaaaataaaaaag*atttttt
        naaaaaatatcgcaatgaaaaataaaaaagatttttttt

Again, MIRA overpredicts diversity, since these two reads are identical, and 
there should be no a/t ambiguity (w) in the consensus in this case. 

Any help or comments are welcome!
B

PS. In our case the problems below can likely be avoided by doing a de-novo or 
de-novo-like assembly with all our current data, but we are working on a range 
of strategies at the moment to learn about different possibilities, and I 
thought the aboce behaviours were of general interest and worth discussing on 
the list.

PS2. I am a bit confused regarding the initial n:s of the reads, not sure yet 
why they appear...












====================================
Björn Nystedt, PhD
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36  Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: