[mira_talk] Re: Alignment errors in Solexa mapping

  • From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 11 Dec 2009 09:11:07 +0100

Thanks heaps!
I guess the takehome message is to carefully edit each site before doing any 
automatic analysis of sequence variations. 
(It can be worth noting that in this case we are doing a genome project, and 
all data is supposed to be identical, i.e. the Solexa data and the reference is 
the exact same strain and DNA prep. Having said that, we also know it's true 
that clonal bacterial colonies are not possible, every colony includes 
variations, which might explain some of the odd stuff. Misassemblies in the 
reference is of course another.)

I am still somewhat confused about exactly how the second error can occur. The 
situation is that this is what you would expect
 ...CATTTTTTC...
     ATTTTTTC...
but this is what you get
  ...CATTTTTTC...
     A*TTTTTTC...
I can see the point of relaxing gap penalties at ends of reads, but to prefer a 
mismatch over a match seems strange to me. Note also that there was no IUPAC in 
the reference in this case! 


About the n:s. They are ok, I see them in consed, but they are masked allright 
so no problem. 


Best 
Björn





On Thu, 10 Dec 2009 19:01:45 +0100
Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:

> On Donnerstag 10 Dezember 2009 Björn Nystedt wrote:
> > [...]
> > 1) ttgccag**agag*g (30 reads)
> > 2) tt****gccagagag (70 reads)
> > 3) ttgattgccagagag (20 reads)
> >
> > This is an overprediction of the diversity, since the two top reads are
> >  really identical.
> 
> For this part there's not much you can do at the moment: mathematically both 
> alignments are identical and get the same score. In de-novo assembly I have 
> developed a few tricks to get gaped ares more or less aligned to themselves 
> and still rspect normal Smith-Waterman rules, but for mapping assemblies this 
> is currently not possible, sorry.
>  
> > Could it be caused by the wildcard charachers in the reference (although
> >  similar things sometimes happens also when there are no wildcards)? Any
> >  ideas on strategies/parameters to avoid it?
> 
> Wildcards may play a role (I'd have to check that myself), a bigger role is 
> the fact that you seem to have two variants which have a short indel as 
> difference. You will note that if the indel occurs towards the middle of 
> reads, MIRA prefers to group it whereas towards ends of reads it doesn't. 
> This 
> too is due to Smith-Waterman formulas and some post-processing to tackle 
> sequencing errors at end of reads.
> 
> These rules backfire in exactly the case you have there.
> 
> > A similar error is quite common (mostly at the end of reads), at sites
> >  where there are (homopolymer) frameshifts in the reference genome:
> > gggcttgctgcaatawtttttttcggtttatatcgtctagtaattcaaaatggcca (consensus)
> > gggcttgctgcaataa*ttttttcggtttatatcgtctagtaattcaaaatggcca (reference)
> >             ataatttttttcggtttatatcgtctagtaattcaaaa
> >               at*ttttttcggtttatatcgtctagtaattcaaaatgg
> >                t*ttttttcggtttatatcgtctagtaattcaaaatggc
> 
> Same symptom, and here the IUPAC certainly plays a role: sequencing errors 
> towards end of reads are always making the life of an alignment program 
> difficult.
> 
> Imagine you had this case:
> 
> ...CATTTTTTC...
>     CTTTTTTC...
> 
> The above is what normal Smith-Waterman would give you, because the score is 
> better than inserting a gap
> 
> ...CATTTTTTC...
>    C*TTTTTTC...
> 
> However, the latter makes more sense from the biological point of view, and 
> therefore MIRA takes that one.
> 
> But as I wrote: it backfires in some other cases *sigh*
> 
> > PS2. I am a bit confused regarding the initial n:s of the reads, not sure
> >  yet why they appear...
> 
> Do the 'n' appear in your assembly? They shouldn't they should be clipped. If 
> not, I'd like to have a sample of your input data to find out what happened.
> 
> The 'n' are an ugly trick: finishing programs like gap or consed do not know 
> sequencing technologies. To still get this info across multiple 
> assembly/finishing cycles, MIRA encodes this info into tags placed on the 
> very 
> first base of a read. It does this in base-tags because editors like consed 
> do 
> not know the concept of "notes" (i.e. comments appended to reads as such).
> 
> Now, these tags can be a little bit disturbing when look at a project in an 
> editor, so MIRA prepends a 'n' to every Solexa read and immediately hides it 
> with a left clip of 1. Ugly, but works.
> 
> As I said: if you see them, please contact me with some data to get that 
> tackled.
> 
> Regards,
>   Bastien
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


-- 
====================================
Björn Nystedt, PhD
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36  Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: