[mira_talk] Re: force MIRA to produce only singletons and no debris

From: Michele Vidotto <michele.vidotto@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Fri, 25 Feb 2011 14:55:30 +0100
I apologize for the late reply, however I wanted to thank you for your
thorough discussion that contains lots of useful considerations!

2011/2/16 Bastien Chevreux <bach@xxxxxxxxxxxx>:
> On Wednesday 16 February 2011 17:42:21 Michele Vidotto wrote:
>
>> In a recent article of transcriptomics, where the assembly was
>
>> performed with MIRA 3
>
>> ("http://www.biomedcentral.com/1471-2164/11/635";), the authors say:
>
>>
>
>> "Due to the heuristic nature of the assembly process and
>
>> previous reports of redundancy (different contig
>
>> Belonging to the transcript sequences Same region) in sets of
>
>> transcriptome contigs assembled with different
>
>> methods [17], a second run of assembly Was Conducted
>
>> using the previously Obtained contigs and singlets as
>
>> input. "
>
> I'm not sure I totally agree with "the heuristic nature" that paragraph
> reads. Oh well.
>
>> So inspired by this statement, I made a script to iterate successive
>
>> assemblies using MIRA, in which, at each cycle, I use as input
>
>> sequences, the contig + singlets produced in the previous cycle.
>
> The following statements are just oppinions of mine, not facts. However ...
> I think it's a bad idea[tm]. In theory, MIRA will have assembled everything
> it can (if not, it's a bug and not a feature), except perhaps in those cases
> where trancripts with a very deep coverage (thousands, not hundreds), are
> assembled. But this will happen only for a few transcripts, the vast
> majority will not be affected.
>
> Now on to why I think this kind of iterational assembly is a bad idea: you
> lose accuracy ... a lot. Consider a case where you perhaps have a ploidy
> variants with one column having 50 T's and 50 A's and one C (the A and T
> being real, let's a ssume the C is a sequencing error). Like this:
>
> ......A.......
>
> ......A.......
>
> ......A.......
>
> ......A.......
>
> etc.
>
> ......T.......
>
> ......T.......
>
> ......C.......
>
> ......T.......
>
> ......T.......
>
> etc.
>
> MIRA will put that in the end into three different contigs. Well, two
> contigs and a singlet (respectively debris).
>
> Now, if you assemble in a second round those two contigs and the singlet,
> you have a big problem: you have 1 sequence with T, another with A, another
> with C. Even when assuming you took care to have the qualities present, the
> C will still be a big contender in the question "what should be the
> consensus?" But even if take the C out of the equation, there's still T and
> A. So which one is it? MIRA will, as default, tell you it cannot really take
> a good decision and put in a IUPAC code. Most probably a "W" (A or T) or
> perhaps a "H" (A or C or T).
>
> Now comes another of my pet peeves: many people don't like IUPACs. Be it
> because their post-processing programs cannot handle them, because it looks
> untidy or unfinished or simply because they want to have "certainty". So
> they set the options in MIRA to force it to take a decision ... which in the
> case portrayed above is, well, not much better than rolling a dice to
> determine whether it's A or T.
>
> So let's just assume MIRA chose a "T", perhaps because the quality for the T
> consensus was a tad higher than for "A". Therefore, your "transcript" for
> the cell has a "T". It looks like this:
>
> ......T.......
>
> Congratulations, well done.
>
> Really?
>
> Assume that the initial context looked like this:
>
> ......AAA.......
>
> ......AAA.......
>
> ......AAA.......
>
> ......AAA.......
>
> etc.
>
> ......TAA.......
>
> ......TAA.......
>
> ......CAA.......
>
> ......TAA.......
>
> ......TAA.......
>
> etc.
>
> You probably see me coming, but for those who don't, here's the story: the
> ploidy in the cell makes it that in this example, two forms of a protein
> transcript exist. One having a stop codon (TAA) at a certain point, one not.
> You just chose the variant with the stop codont, and if you later find out
> that this protein is then 15 amino acids long instead of 799 aa like similar
> proteins in the databases you have, you will immediately say: that organism
> has a deficiency, "protein X" is broken because of an early termination in
> the gene! I made a discovery! This new strain of the well known organism
> produces chocolate even though the well known only chocolate synthesising
> gene is broken! This proves that there must be a second gene perfoming the
> chocolate synthesis! Where's my Nobel Prize?! Oh, first I make an
> announcement and then invest a couple of months to find that other chocolate
> synthesising gene ... will be easy.
>
> 2 years later and after a couple of unsuccessfull attempts to find the new
> chocolate synthesis gene, you go back to the initial data set and discover
> that your organism there's a second transcript of the well known protein
> which makes a perfectly valid chocolate synthesis protein.
>
> Bummer, you just lost two years of your life time and a lot of credibility
> because you wanted to be certain and then jumped the gun.
>
>> For this reason I would find a combination of settings in MIRA that
>
>> let me get in the output all sequences as contigs or singlets, but not
>
>> as debris.
>
> Even if some reads land in the debris, you can still use them by creating a
> fasta file containing the original reads of the debris and append that to
> the MIRA result FASTA containing the contigs and singlets. Pretty trivial to
> do. Actually, the "fastaselect.tcl" script in the distribution will help you
> there.
>
>> I take this opportunity also to ask whether such approach can really
>
>> improve the assembly of a transcriptome or whether, in your opinion,
>
>> it did nothing but introduce errors.
>
> Hmmm, I think I made my point above. Sorry if it sounded a bit like a rant
> ... I just felt a bit like it. It was certainly not targeted at you :-)
>
> B.



-- 
Michele Vidotto
(Ph.D. Student)
Department of Biology
Universita` degli Studi di Padova
Via Ugo Bassi 58/B,
35131, Padova, Italy
Phone: +39 049 827 6204
Fax: +39 049 827 6209
mailto: michele.vidotto@xxxxxxxxxxxxxxxxx

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
References:
- [mira_talk] force MIRA to produce only singletons and no debris
  - From: Michele Vidotto
- [mira_talk] Re: force MIRA to produce only singletons and no debris
  - From: Bastien Chevreux
- [mira_talk] Re: force MIRA to produce only singletons and no debris
  - From: Michele Vidotto
- [mira_talk] Re: force MIRA to produce only singletons and no debris
  - From: Bastien Chevreux
[mira_talk] Re: force MIRA to produce only singletons and no debris

Other related posts: