On Wednesday 16 February 2011 17:42:21 Michele Vidotto wrote: > In a recent article of transcriptomics, where the assembly was > performed with MIRA 3 > ("http://www.biomedcentral.com/1471-2164/11/635";), the authors say: > > "Due to the heuristic nature of the assembly process and > previous reports of redundancy (different contig > Belonging to the transcript sequences Same region) in sets of > transcriptome contigs assembled with different > methods [17], a second run of assembly Was Conducted > using the previously Obtained contigs and singlets as > input. " I'm not sure I totally agree with "the heuristic nature" that paragraph reads. Oh well. > So inspired by this statement, I made a script to iterate successive > assemblies using MIRA, in which, at each cycle, I use as input > sequences, the contig + singlets produced in the previous cycle. The following statements are just oppinions of mine, not facts. However ... I think it's a bad idea[tm]. In theory, MIRA will have assembled everything it can (if not, it's a bug and not a feature), except perhaps in those cases where trancripts with a very deep coverage (thousands, not hundreds), are assembled. But this will happen only for a few transcripts, the vast majority will not be affected. Now on to why I think this kind of iterational assembly is a bad idea: you lose accuracy ... a lot. Consider a case where you perhaps have a ploidy variants with one column having 50 T's and 50 A's and one C (the A and T being real, let's a ssume the C is a sequencing error). Like this: ......A....... ......A....... ......A....... ......A....... etc. ......T....... ......T....... ......C....... ......T....... ......T....... etc. MIRA will put that in the end into three different contigs. Well, two contigs and a singlet (respectively debris). Now, if you assemble in a second round those two contigs and the singlet, you have a big problem: you have 1 sequence with T, another with A, another with C. Even when assuming you took care to have the qualities present, the C will still be a big contender in the question "what should be the consensus?" But even if take the C out of the equation, there's still T and A. So which one is it? MIRA will, as default, tell you it cannot really take a good decision and put in a IUPAC code. Most probably a "W" (A or T) or perhaps a "H" (A or C or T). Now comes another of my pet peeves: many people don't like IUPACs. Be it because their post-processing programs cannot handle them, because it looks untidy or unfinished or simply because they want to have "certainty". So they set the options in MIRA to force it to take a decision ... which in the case portrayed above is, well, not much better than rolling a dice to determine whether it's A or T. So let's just assume MIRA chose a "T", perhaps because the quality for the T consensus was a tad higher than for "A". Therefore, your "transcript" for the cell has a "T". It looks like this: ......T....... Congratulations, well done. Really? Assume that the initial context looked like this: ......AAA....... ......AAA....... ......AAA....... ......AAA....... etc. ......TAA....... ......TAA....... ......CAA....... ......TAA....... ......TAA....... etc. You probably see me coming, but for those who don't, here's the story: the ploidy in the cell makes it that in this example, two forms of a protein transcript exist. One having a stop codon (TAA) at a certain point, one not. You just chose the variant with the stop codont, and if you later find out that this protein is then 15 amino acids long instead of 799 aa like similar proteins in the databases you have, you will immediately say: that organism has a deficiency, "protein X" is broken because of an early termination in the gene! I made a discovery! This new strain of the well known organism produces chocolate even though the well known only chocolate synthesising gene is broken! This proves that there must be a second gene perfoming the chocolate synthesis! Where's my Nobel Prize?! Oh, first I make an announcement and then invest a couple of months to find that other chocolate synthesising gene ... will be easy. 2 years later and after a couple of unsuccessfull attempts to find the new chocolate synthesis gene, you go back to the initial data set and discover that your organism there's a second transcript of the well known protein which makes a perfectly valid chocolate synthesis protein. Bummer, you just lost two years of your life time and a lot of credibility because you wanted to be certain and then jumped the gun. > For this reason I would find a combination of settings in MIRA that > let me get in the output all sequences as contigs or singlets, but not > as debris. Even if some reads land in the debris, you can still use them by creating a fasta file containing the original reads of the debris and append that to the MIRA result FASTA containing the contigs and singlets. Pretty trivial to do. Actually, the "fastaselect.tcl" script in the distribution will help you there. > I take this opportunity also to ask whether such approach can really > improve the assembly of a transcriptome or whether, in your opinion, > it did nothing but introduce errors. Hmmm, I think I made my point above. Sorry if it sounded a bit like a rant ... I just felt a bit like it. It was certainly not targeted at you :-) B.