[mira_talk] Re: Alignment errors in Solexa mapping

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 10 Dec 2009 19:01:45 +0100

On Donnerstag 10 Dezember 2009 Björn Nystedt wrote:
> [...]
> 1) ttgccag**agag*g (30 reads)
> 2) tt****gccagagag (70 reads)
> 3) ttgattgccagagag (20 reads)
>
> This is an overprediction of the diversity, since the two top reads are
>  really identical.

For this part there's not much you can do at the moment: mathematically both 
alignments are identical and get the same score. In de-novo assembly I have 
developed a few tricks to get gaped ares more or less aligned to themselves 
and still rspect normal Smith-Waterman rules, but for mapping assemblies this 
is currently not possible, sorry.
 
> Could it be caused by the wildcard charachers in the reference (although
>  similar things sometimes happens also when there are no wildcards)? Any
>  ideas on strategies/parameters to avoid it?

Wildcards may play a role (I'd have to check that myself), a bigger role is 
the fact that you seem to have two variants which have a short indel as 
difference. You will note that if the indel occurs towards the middle of 
reads, MIRA prefers to group it whereas towards ends of reads it doesn't. This 
too is due to Smith-Waterman formulas and some post-processing to tackle 
sequencing errors at end of reads.

These rules backfire in exactly the case you have there.

> A similar error is quite common (mostly at the end of reads), at sites
>  where there are (homopolymer) frameshifts in the reference genome:
> gggcttgctgcaatawtttttttcggtttatatcgtctagtaattcaaaatggcca (consensus)
> gggcttgctgcaataa*ttttttcggtttatatcgtctagtaattcaaaatggcca (reference)
>             ataatttttttcggtttatatcgtctagtaattcaaaa
>               at*ttttttcggtttatatcgtctagtaattcaaaatgg
>                t*ttttttcggtttatatcgtctagtaattcaaaatggc

Same symptom, and here the IUPAC certainly plays a role: sequencing errors 
towards end of reads are always making the life of an alignment program 
difficult.

Imagine you had this case:

...CATTTTTTC...
    CTTTTTTC...

The above is what normal Smith-Waterman would give you, because the score is 
better than inserting a gap

...CATTTTTTC...
   C*TTTTTTC...

However, the latter makes more sense from the biological point of view, and 
therefore MIRA takes that one.

But as I wrote: it backfires in some other cases *sigh*

> PS2. I am a bit confused regarding the initial n:s of the reads, not sure
>  yet why they appear...

Do the 'n' appear in your assembly? They shouldn't they should be clipped. If 
not, I'd like to have a sample of your input data to find out what happened.

The 'n' are an ugly trick: finishing programs like gap or consed do not know 
sequencing technologies. To still get this info across multiple 
assembly/finishing cycles, MIRA encodes this info into tags placed on the very 
first base of a read. It does this in base-tags because editors like consed do 
not know the concept of "notes" (i.e. comments appended to reads as such).

Now, these tags can be a little bit disturbing when look at a project in an 
editor, so MIRA prepends a 'n' to every Solexa read and immediately hides it 
with a left clip of 1. Ugly, but works.

As I said: if you see them, please contact me with some data to get that 
tackled.

Regards,
  Bastien

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: