[mira_talk] Re: force MIRA to produce only singletons and no debris

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 16 Feb 2011 21:19:18 +0100

On Wednesday 16 February 2011 17:42:21 Michele Vidotto wrote:
> In a recent article of transcriptomics, where the assembly was
> performed with MIRA 3
> ("http://www.biomedcentral.com/1471-2164/11/635";), the authors say:
> 
> "Due to the heuristic nature of the assembly process and
> previous reports of redundancy (different contig
> Belonging to the transcript sequences Same region) in sets of
> transcriptome contigs assembled with different
> methods [17], a second run of assembly Was Conducted
> using the previously Obtained contigs and singlets as
> input. "

I'm not sure I totally agree with "the heuristic nature" that paragraph reads. 
Oh well.

> So inspired by this statement, I made a script to iterate successive
> assemblies using MIRA, in which, at each cycle, I use as input
> sequences, the contig + singlets produced in the previous cycle.

The following statements are just oppinions of mine, not facts. However ... I 
think it's a bad idea[tm]. In theory, MIRA will have assembled everything it 
can (if not, it's a bug and not a feature), except perhaps in those cases 
where trancripts with a very deep coverage (thousands, not hundreds), are 
assembled. But this will happen only for a few transcripts, the vast majority 
will not be affected.

Now on to why I think this kind of iterational assembly is a bad idea: you 
lose accuracy ... a lot. Consider a case where you perhaps have a ploidy 
variants with one column having 50 T's and 50 A's and one C (the A and T being 
real, let's a ssume the C is a sequencing error). Like this:

......A.......
......A.......
......A.......
......A.......
etc.
......T.......
......T.......
......C.......
......T.......
......T.......
etc.

MIRA will put that in the end into three different contigs. Well, two contigs 
and a singlet (respectively debris).

Now, if you assemble in a second round those two contigs and the singlet, you 
have a big problem: you have 1 sequence with T, another with A, another with 
C. Even when assuming you took care to have the qualities present, the C will 
still be a big contender in the question "what should be the consensus?" But 
even if take the C out of the equation, there's still T and A. So which one is 
it? MIRA will, as default, tell you it cannot really take a good decision and 
put in a IUPAC code. Most probably a "W" (A or T) or perhaps a "H" (A or C or 
T).

Now comes another of my pet peeves: many people don't like IUPACs. Be it 
because their post-processing programs cannot handle them, because it looks 
untidy or unfinished or simply because they want to have "certainty". So they 
set the options in MIRA to force it to take a decision ... which in the case 
portrayed above is, well, not much better than rolling a dice to determine 
whether it's A or T.

So let's just assume MIRA chose a "T", perhaps because the quality for the T 
consensus was a tad higher than for "A". Therefore, your "transcript" for the 
cell has a "T". It looks like this:

......T.......

Congratulations, well done.

Really?

Assume that the initial context looked like this:

......AAA.......
......AAA.......
......AAA.......
......AAA.......
etc.
......TAA.......
......TAA.......
......CAA.......
......TAA.......
......TAA.......
etc.

You probably see me coming, but for those who don't, here's the story: the 
ploidy in the cell makes it that in this example, two forms of a protein 
transcript exist. One having a stop codon (TAA) at a certain point, one not. 
You just chose the variant with the stop codont, and if you later find out that 
this protein is then 15 amino acids long instead of 799 aa like similar 
proteins in the databases you have, you will immediately say: that organism 
has a deficiency, "protein X" is broken because of an early termination in the 
gene! I made a discovery! This new strain of the well known organism  produces 
chocolate even though the well known only chocolate synthesising gene is 
broken! This proves that there must be a second gene perfoming the chocolate 
synthesis! Where's my Nobel Prize?! Oh, first I make an announcement and then 
invest a couple of months to find that other chocolate synthesising gene ... 
will be easy.

2 years later and after a couple of unsuccessfull attempts to find the new 
chocolate synthesis gene, you go back to the initial data set and discover 
that your organism there's a second transcript of the well known protein which 
makes a perfectly valid chocolate synthesis protein.

Bummer, you just lost two years of your life time and a lot of credibility 
because you wanted to be certain and then jumped the gun.

> For this reason I would find a combination of settings in MIRA that
> let me get in the output all sequences as contigs or singlets, but not
> as debris.

Even if some reads land in the debris, you can still use them by creating a 
fasta file containing the original reads of the debris and append that to the 
MIRA result FASTA containing the contigs and singlets. Pretty trivial to do. 
Actually, the "fastaselect.tcl" script in the distribution will help you 
there.

> I take this opportunity also to ask whether such approach can really
> improve the assembly of a transcriptome or whether, in your opinion,
> it did nothing but introduce errors.

Hmmm, I think I made my point above. Sorry if it sounded a bit like a rant ... 
I just felt a bit like it. It was certainly not targeted at you :-)

B.

Other related posts: