On Donnerstag 10 Dezember 2009 Björn Nystedt wrote: > [...] > 1) ttgccag**agag*g (30 reads) > 2) tt****gccagagag (70 reads) > 3) ttgattgccagagag (20 reads) > > This is an overprediction of the diversity, since the two top reads are > really identical. For this part there's not much you can do at the moment: mathematically both alignments are identical and get the same score. In de-novo assembly I have developed a few tricks to get gaped ares more or less aligned to themselves and still rspect normal Smith-Waterman rules, but for mapping assemblies this is currently not possible, sorry. > Could it be caused by the wildcard charachers in the reference (although > similar things sometimes happens also when there are no wildcards)? Any > ideas on strategies/parameters to avoid it? Wildcards may play a role (I'd have to check that myself), a bigger role is the fact that you seem to have two variants which have a short indel as difference. You will note that if the indel occurs towards the middle of reads, MIRA prefers to group it whereas towards ends of reads it doesn't. This too is due to Smith-Waterman formulas and some post-processing to tackle sequencing errors at end of reads. These rules backfire in exactly the case you have there. > A similar error is quite common (mostly at the end of reads), at sites > where there are (homopolymer) frameshifts in the reference genome: > gggcttgctgcaatawtttttttcggtttatatcgtctagtaattcaaaatggcca (consensus) > gggcttgctgcaataa*ttttttcggtttatatcgtctagtaattcaaaatggcca (reference) > ataatttttttcggtttatatcgtctagtaattcaaaa > at*ttttttcggtttatatcgtctagtaattcaaaatgg > t*ttttttcggtttatatcgtctagtaattcaaaatggc Same symptom, and here the IUPAC certainly plays a role: sequencing errors towards end of reads are always making the life of an alignment program difficult. Imagine you had this case: ...CATTTTTTC... CTTTTTTC... The above is what normal Smith-Waterman would give you, because the score is better than inserting a gap ...CATTTTTTC... C*TTTTTTC... However, the latter makes more sense from the biological point of view, and therefore MIRA takes that one. But as I wrote: it backfires in some other cases *sigh* > PS2. I am a bit confused regarding the initial n:s of the reads, not sure > yet why they appear... Do the 'n' appear in your assembly? They shouldn't they should be clipped. If not, I'd like to have a sample of your input data to find out what happened. The 'n' are an ugly trick: finishing programs like gap or consed do not know sequencing technologies. To still get this info across multiple assembly/finishing cycles, MIRA encodes this info into tags placed on the very first base of a read. It does this in base-tags because editors like consed do not know the concept of "notes" (i.e. comments appended to reads as such). Now, these tags can be a little bit disturbing when look at a project in an editor, so MIRA prepends a 'n' to every Solexa read and immediately hides it with a left clip of 1. Ugly, but works. As I said: if you see them, please contact me with some data to get that tackled. Regards, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html