Sorry , the last example was a poor represenetative, the errors are more often of this type: > A similar error is quite common (mostly at the end of reads), at sites where > there are (homopolymer) frameshifts in the reference genome: gggcttgctgcaatawtttttttcggtttatatcgtctagtaattcaaaatggcca (consensus) gggcttgctgcaataa*ttttttcggtttatatcgtctagtaattcaaaatggcca (reference/backbone) ataatttttttcggtttatatcgtctagtaattcaaaa at*ttttttcggtttatatcgtctagtaattcaaaatgg t*ttttttcggtttatatcgtctagtaattcaaaatggc > Again, MIRA overpredicts diversity, since all three reads are identical, and > there should be no a/t ambiguity (w) in the consensus in this case. B On Thu, 10 Dec 2009 13:34:42 +0100 Björn Nystedt <bjorn.nystedt@xxxxxxxxx> wrote: > Hi all, > having fun learning to work with Solexa reads :) > > I was mapping Solexa reads to a reference genome (= a single contig from an > earlier MIRA assembly) using MIRA V3rc4d (sorry Bastien, missed the strain > info again..) > > mira --project=BAnh1IrMREF07 --job=mapping,genome,accurate,solexa > -OUT:ora=yes -GE:not=3 -AS: > urd=yes -SB:lb=yes:bft=fasta:bbq=30 SOLEXA_SETTINGS -LR:ft=fastq:fqqo=64 > -CO:msr=yes -GE:uti= > no:tismin=200:tismax=400 > > Going through the alignments, I noticed some odd parts, such as this: > > attggcgtgaagagtttgaatgagattgattgmmwgccagagaggccagaatatgttacaccgccgtcga > (consensus) > attggcgtgaagagtttgaatgagattgattgcmagccagagaggccagaatatgttacaccgccgtcga > (reference/backbone) > attggcgtgaagagtttgaatgagattgattgccag**agn > attggcgtgaagagtttgaatgagattgattgccag**agn > attggcgtgaagagtttgaatgagattgattgccag**agn > ttggcgtgaagagtttgaatgagattgattgccag**agan > tggcgtgaagagtttgaatgagattgattgccag**agagn > ggcgtgaagagtttgaatgagattgattgccag**agag*gn > ggcgtgaagagtttgaatgagattgattgccag**agag*gn > ggcgtgaagagtttgaatgagattgattgccag**agag*gn > gtgaagagtttgaatgagattgatt****gccagagaggccan > gtgaagagtttgaatgagattgatt****gccagagaggccan > gtgaagagtttgaatgagattgatt****gccagagaggccan > gtgaagagtttgaatgagattgatt****gccagagaggccan > tgaagagtttgaatgagattgatt****gccagagaggccagn > tgaagagtttgaatgagattgatt****gccagagaggccagn > gattgattgccagagaggccagaatatgttacaccgccn > attgccagagaggccagaatatgttacaccgccgtcga > > > (Note that these are only a few selected reads in the alignment; we have > ~120X at this site which is normal given the amount of data). As you see, > there are 3 predicted variants (all three are present in both strand > directions): > > 1) ttgccag**agag*g (30 reads) > 2) tt****gccagagag (70 reads) > 3) ttgattgccagagag (20 reads) > > This is an overprediction of the diversity, since the two top reads are > really identical. > > Anyone seen this behaviour before? > Could it be caused by the wildcard charachers in the reference (although > similar things sometimes happens also when there are no wildcards)? > Any ideas on strategies/parameters to avoid it? > > > A similar error is quite common (mostly at the end of reads), at sites where > there are (homopolymer) frameshifts in the reference genome: > > aaagacgataaaaaatatcgcaatgaaaaataaaaaagawtttttttgttttatc(consensus) > aaagacgataaaaaatatcgcaatgaaaaat*aaaaagaatttttttgttttatcac (reference/backbone) > nataaaaaatatcgcaatgaaaaataaaaaag*atttttt > naaaaaatatcgcaatgaaaaataaaaaagatttttttt > > Again, MIRA overpredicts diversity, since these two reads are identical, and > there should be no a/t ambiguity (w) in the consensus in this case. > > Any help or comments are welcome! > B > > PS. In our case the problems below can likely be avoided by doing a de-novo > or de-novo-like assembly with all our current data, but we are working on a > range of strategies at the moment to learn about different possibilities, and > I thought the aboce behaviours were of general interest and worth discussing > on the list. > > PS2. I am a bit confused regarding the initial n:s of the reads, not sure yet > why they appear... > > > > > > > > > > > > > ==================================== > Björn Nystedt, PhD > Molecular Evolution > EBC, Uppsala University > Norbyv. 18C, 752 36 Uppsala > Sweden > phone: +46 (0)18-471 45 88 > email: Bjorn.Nystedt@xxxxxxxxx > ==================================== > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- ==================================== Björn Nystedt, PhD Molecular Evolution EBC, Uppsala University Norbyv. 18C, 752 36 Uppsala Sweden phone: +46 (0)18-471 45 88 email: Bjorn.Nystedt@xxxxxxxxx ==================================== -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html