I apologize for the late reply, however I wanted to thank you for your thorough discussion that contains lots of useful considerations! 2011/2/16 Bastien Chevreux <bach@xxxxxxxxxxxx>: > On Wednesday 16 February 2011 17:42:21 Michele Vidotto wrote: > >> In a recent article of transcriptomics, where the assembly was > >> performed with MIRA 3 > >> ("http://www.biomedcentral.com/1471-2164/11/635";), the authors say: > >> > >> "Due to the heuristic nature of the assembly process and > >> previous reports of redundancy (different contig > >> Belonging to the transcript sequences Same region) in sets of > >> transcriptome contigs assembled with different > >> methods [17], a second run of assembly Was Conducted > >> using the previously Obtained contigs and singlets as > >> input. " > > I'm not sure I totally agree with "the heuristic nature" that paragraph > reads. Oh well. > >> So inspired by this statement, I made a script to iterate successive > >> assemblies using MIRA, in which, at each cycle, I use as input > >> sequences, the contig + singlets produced in the previous cycle. > > The following statements are just oppinions of mine, not facts. However ... > I think it's a bad idea[tm]. In theory, MIRA will have assembled everything > it can (if not, it's a bug and not a feature), except perhaps in those cases > where trancripts with a very deep coverage (thousands, not hundreds), are > assembled. But this will happen only for a few transcripts, the vast > majority will not be affected. > > Now on to why I think this kind of iterational assembly is a bad idea: you > lose accuracy ... a lot. Consider a case where you perhaps have a ploidy > variants with one column having 50 T's and 50 A's and one C (the A and T > being real, let's a ssume the C is a sequencing error). Like this: > > ......A....... > > ......A....... > > ......A....... > > ......A....... > > etc. > > ......T....... > > ......T....... > > ......C....... > > ......T....... > > ......T....... > > etc. > > MIRA will put that in the end into three different contigs. Well, two > contigs and a singlet (respectively debris). > > Now, if you assemble in a second round those two contigs and the singlet, > you have a big problem: you have 1 sequence with T, another with A, another > with C. Even when assuming you took care to have the qualities present, the > C will still be a big contender in the question "what should be the > consensus?" But even if take the C out of the equation, there's still T and > A. So which one is it? MIRA will, as default, tell you it cannot really take > a good decision and put in a IUPAC code. Most probably a "W" (A or T) or > perhaps a "H" (A or C or T). > > Now comes another of my pet peeves: many people don't like IUPACs. Be it > because their post-processing programs cannot handle them, because it looks > untidy or unfinished or simply because they want to have "certainty". So > they set the options in MIRA to force it to take a decision ... which in the > case portrayed above is, well, not much better than rolling a dice to > determine whether it's A or T. > > So let's just assume MIRA chose a "T", perhaps because the quality for the T > consensus was a tad higher than for "A". Therefore, your "transcript" for > the cell has a "T". It looks like this: > > ......T....... > > Congratulations, well done. > > Really? > > Assume that the initial context looked like this: > > ......AAA....... > > ......AAA....... > > ......AAA....... > > ......AAA....... > > etc. > > ......TAA....... > > ......TAA....... > > ......CAA....... > > ......TAA....... > > ......TAA....... > > etc. > > You probably see me coming, but for those who don't, here's the story: the > ploidy in the cell makes it that in this example, two forms of a protein > transcript exist. One having a stop codon (TAA) at a certain point, one not. > You just chose the variant with the stop codont, and if you later find out > that this protein is then 15 amino acids long instead of 799 aa like similar > proteins in the databases you have, you will immediately say: that organism > has a deficiency, "protein X" is broken because of an early termination in > the gene! I made a discovery! This new strain of the well known organism > produces chocolate even though the well known only chocolate synthesising > gene is broken! This proves that there must be a second gene perfoming the > chocolate synthesis! Where's my Nobel Prize?! Oh, first I make an > announcement and then invest a couple of months to find that other chocolate > synthesising gene ... will be easy. > > 2 years later and after a couple of unsuccessfull attempts to find the new > chocolate synthesis gene, you go back to the initial data set and discover > that your organism there's a second transcript of the well known protein > which makes a perfectly valid chocolate synthesis protein. > > Bummer, you just lost two years of your life time and a lot of credibility > because you wanted to be certain and then jumped the gun. > >> For this reason I would find a combination of settings in MIRA that > >> let me get in the output all sequences as contigs or singlets, but not > >> as debris. > > Even if some reads land in the debris, you can still use them by creating a > fasta file containing the original reads of the debris and append that to > the MIRA result FASTA containing the contigs and singlets. Pretty trivial to > do. Actually, the "fastaselect.tcl" script in the distribution will help you > there. > >> I take this opportunity also to ask whether such approach can really > >> improve the assembly of a transcriptome or whether, in your opinion, > >> it did nothing but introduce errors. > > Hmmm, I think I made my point above. Sorry if it sounded a bit like a rant > ... I just felt a bit like it. It was certainly not targeted at you :-) > > B. -- Michele Vidotto (Ph.D. Student) Department of Biology Universita` degli Studi di Padova Via Ugo Bassi 58/B, 35131, Padova, Italy Phone: +39 049 827 6204 Fax: +39 049 827 6209 mailto: michele.vidotto@xxxxxxxxxxxxxxxxx -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html