Hi all, having fun learning to work with Solexa reads :) I was mapping Solexa reads to a reference genome (= a single contig from an earlier MIRA assembly) using MIRA V3rc4d (sorry Bastien, missed the strain info again..) mira --project=BAnh1IrMREF07 --job=mapping,genome,accurate,solexa -OUT:ora=yes -GE:not=3 -AS: urd=yes -SB:lb=yes:bft=fasta:bbq=30 SOLEXA_SETTINGS -LR:ft=fastq:fqqo=64 -CO:msr=yes -GE:uti= no:tismin=200:tismax=400 Going through the alignments, I noticed some odd parts, such as this: attggcgtgaagagtttgaatgagattgattgmmwgccagagaggccagaatatgttacaccgccgtcga (consensus) attggcgtgaagagtttgaatgagattgattgcmagccagagaggccagaatatgttacaccgccgtcga (reference/backbone) attggcgtgaagagtttgaatgagattgattgccag**agn attggcgtgaagagtttgaatgagattgattgccag**agn attggcgtgaagagtttgaatgagattgattgccag**agn ttggcgtgaagagtttgaatgagattgattgccag**agan tggcgtgaagagtttgaatgagattgattgccag**agagn ggcgtgaagagtttgaatgagattgattgccag**agag*gn ggcgtgaagagtttgaatgagattgattgccag**agag*gn ggcgtgaagagtttgaatgagattgattgccag**agag*gn gtgaagagtttgaatgagattgatt****gccagagaggccan gtgaagagtttgaatgagattgatt****gccagagaggccan gtgaagagtttgaatgagattgatt****gccagagaggccan gtgaagagtttgaatgagattgatt****gccagagaggccan tgaagagtttgaatgagattgatt****gccagagaggccagn tgaagagtttgaatgagattgatt****gccagagaggccagn gattgattgccagagaggccagaatatgttacaccgccn attgccagagaggccagaatatgttacaccgccgtcga (Note that these are only a few selected reads in the alignment; we have ~120X at this site which is normal given the amount of data). As you see, there are 3 predicted variants (all three are present in both strand directions): 1) ttgccag**agag*g (30 reads) 2) tt****gccagagag (70 reads) 3) ttgattgccagagag (20 reads) This is an overprediction of the diversity, since the two top reads are really identical. Anyone seen this behaviour before? Could it be caused by the wildcard charachers in the reference (although similar things sometimes happens also when there are no wildcards)? Any ideas on strategies/parameters to avoid it? A similar error is quite common (mostly at the end of reads), at sites where there are (homopolymer) frameshifts in the reference genome: aaagacgataaaaaatatcgcaatgaaaaataaaaaagawtttttttgttttatc(consensus) aaagacgataaaaaatatcgcaatgaaaaat*aaaaagaatttttttgttttatcac (reference/backbone) nataaaaaatatcgcaatgaaaaataaaaaag*atttttt naaaaaatatcgcaatgaaaaataaaaaagatttttttt Again, MIRA overpredicts diversity, since these two reads are identical, and there should be no a/t ambiguity (w) in the consensus in this case. Any help or comments are welcome! B PS. In our case the problems below can likely be avoided by doing a de-novo or de-novo-like assembly with all our current data, but we are working on a range of strategies at the moment to learn about different possibilities, and I thought the aboce behaviours were of general interest and worth discussing on the list. PS2. I am a bit confused regarding the initial n:s of the reads, not sure yet why they appear... ==================================== Björn Nystedt, PhD Molecular Evolution EBC, Uppsala University Norbyv. 18C, 752 36 Uppsala Sweden phone: +46 (0)18-471 45 88 email: Bjorn.Nystedt@xxxxxxxxx ==================================== -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html