[mira_talk] Re: RE Call for testing: MIRA 3.4rc2

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Sat, 30 Jul 2011 22:27:44 +0200

On Friday 29 July 2011 12:39:00 Jorge.DUARTE@xxxxxxxxxxxx wrote:
> I've tested mira_3.4.rc3 and compared its results with mira_3.2.1.7_dev
> with same options (see bellow) and on same dataset (normalized rna on 8
> genotypes on a plant)
> [...]
> These numbers seem quite high considering that i used same options and
> same dataset, and that the only change was the mira version...

Hello Jorge,

actually, the changes to the assembly engine after 3.2.1.7 were pretty 
important. Most of it was driven by me doing RNASeq assemblies with 100bp 
Solexas where I encountered a number of sequencing artefacts both specific and 
unspecific to Illumina sequencing, so a number of routines have been rewritten 
and/or expanded to correctly handle these artefacts.

Example given: One sometimes find "genomic" reads in the data sets, which many 
people attribute to contamination with genomic sequence. While I agree that 
this is possible and sometimes likely, a couple of observations I made let me 
think that it's sometimes not contamination, but simply the non-pre-processed, 
non-spliced mRNA that cought caught by NGS. So I put some efforts in getting 
those things not assembled into 'real' mRNA.

It is therefore not too surprising to me that some key numbers in the 
assemblies changed, though I cannot judge whether the numbers you see are "too 
large" or "OK".

> My first guess was that both versions give the same core contigs, and that
> they differ on shorter and less covered contigs, but on the contrary, when
> looking at it into more details, the "specific contigs" from both
> assemblies were long and highly covered contigs (>1.2kb, >60 reads on
> average per contig),
> compared to the complete original data sets (850bp , 30 reads per contig
> on average).
> 
> Can you comment on this ? And maybe point me to other methods/metrics to
> look at in order to compare both assemblies ?

What I've seen other people often do (in papers & posters) is to simply BLAST 
against protein NR database and count the number of contigs which have "good" 
hits. I'm not sure whether this is the most accurate measurement of assembly 
quality, but it's not a bad one per se and as I have no other alternative I'd 
also recommend doing that.

On the other hand, it might be interesting to find out what happened to reads 
of contigs which are not present in the "other assembly." What I'd do in your 
place would be to search for a couple of contigs which are "good" in your eyes 
(long and good coverage) and see what happened in the other assembly. Of 
course, this is just "poking around" a bit and certainly not comprehensive, 
but it helps to get a feeling whether to trust or distrust one or the other 
assembly.

I'd be happy to hear what you find out there.

Best,
  Bastien

References:
- [mira_talk] RE Call for testing: MIRA 3.4rc2
  - From: Jorge . DUARTE

[mira_talk] Re: RE Call for testing: MIRA 3.4rc2

Other related posts: