Thanks heaps! I guess the takehome message is to carefully edit each site before doing any automatic analysis of sequence variations. (It can be worth noting that in this case we are doing a genome project, and all data is supposed to be identical, i.e. the Solexa data and the reference is the exact same strain and DNA prep. Having said that, we also know it's true that clonal bacterial colonies are not possible, every colony includes variations, which might explain some of the odd stuff. Misassemblies in the reference is of course another.) I am still somewhat confused about exactly how the second error can occur. The situation is that this is what you would expect ...CATTTTTTC... ATTTTTTC... but this is what you get ...CATTTTTTC... A*TTTTTTC... I can see the point of relaxing gap penalties at ends of reads, but to prefer a mismatch over a match seems strange to me. Note also that there was no IUPAC in the reference in this case! About the n:s. They are ok, I see them in consed, but they are masked allright so no problem. Best Björn On Thu, 10 Dec 2009 19:01:45 +0100 Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: > On Donnerstag 10 Dezember 2009 Björn Nystedt wrote: > > [...] > > 1) ttgccag**agag*g (30 reads) > > 2) tt****gccagagag (70 reads) > > 3) ttgattgccagagag (20 reads) > > > > This is an overprediction of the diversity, since the two top reads are > > really identical. > > For this part there's not much you can do at the moment: mathematically both > alignments are identical and get the same score. In de-novo assembly I have > developed a few tricks to get gaped ares more or less aligned to themselves > and still rspect normal Smith-Waterman rules, but for mapping assemblies this > is currently not possible, sorry. > > > Could it be caused by the wildcard charachers in the reference (although > > similar things sometimes happens also when there are no wildcards)? Any > > ideas on strategies/parameters to avoid it? > > Wildcards may play a role (I'd have to check that myself), a bigger role is > the fact that you seem to have two variants which have a short indel as > difference. You will note that if the indel occurs towards the middle of > reads, MIRA prefers to group it whereas towards ends of reads it doesn't. > This > too is due to Smith-Waterman formulas and some post-processing to tackle > sequencing errors at end of reads. > > These rules backfire in exactly the case you have there. > > > A similar error is quite common (mostly at the end of reads), at sites > > where there are (homopolymer) frameshifts in the reference genome: > > gggcttgctgcaatawtttttttcggtttatatcgtctagtaattcaaaatggcca (consensus) > > gggcttgctgcaataa*ttttttcggtttatatcgtctagtaattcaaaatggcca (reference) > > ataatttttttcggtttatatcgtctagtaattcaaaa > > at*ttttttcggtttatatcgtctagtaattcaaaatgg > > t*ttttttcggtttatatcgtctagtaattcaaaatggc > > Same symptom, and here the IUPAC certainly plays a role: sequencing errors > towards end of reads are always making the life of an alignment program > difficult. > > Imagine you had this case: > > ...CATTTTTTC... > CTTTTTTC... > > The above is what normal Smith-Waterman would give you, because the score is > better than inserting a gap > > ...CATTTTTTC... > C*TTTTTTC... > > However, the latter makes more sense from the biological point of view, and > therefore MIRA takes that one. > > But as I wrote: it backfires in some other cases *sigh* > > > PS2. I am a bit confused regarding the initial n:s of the reads, not sure > > yet why they appear... > > Do the 'n' appear in your assembly? They shouldn't they should be clipped. If > not, I'd like to have a sample of your input data to find out what happened. > > The 'n' are an ugly trick: finishing programs like gap or consed do not know > sequencing technologies. To still get this info across multiple > assembly/finishing cycles, MIRA encodes this info into tags placed on the > very > first base of a read. It does this in base-tags because editors like consed > do > not know the concept of "notes" (i.e. comments appended to reads as such). > > Now, these tags can be a little bit disturbing when look at a project in an > editor, so MIRA prepends a 'n' to every Solexa read and immediately hides it > with a left clip of 1. Ugly, but works. > > As I said: if you see them, please contact me with some data to get that > tackled. > > Regards, > Bastien > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- ==================================== Björn Nystedt, PhD Molecular Evolution EBC, Uppsala University Norbyv. 18C, 752 36 Uppsala Sweden phone: +46 (0)18-471 45 88 email: Bjorn.Nystedt@xxxxxxxxx ==================================== -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html