On Friday 29 July 2011 12:39:00 Jorge.DUARTE@xxxxxxxxxxxx wrote: > I've tested mira_3.4.rc3 and compared its results with mira_3.2.1.7_dev > with same options (see bellow) and on same dataset (normalized rna on 8 > genotypes on a plant) > [...] > These numbers seem quite high considering that i used same options and > same dataset, and that the only change was the mira version... Hello Jorge, actually, the changes to the assembly engine after 3.2.1.7 were pretty important. Most of it was driven by me doing RNASeq assemblies with 100bp Solexas where I encountered a number of sequencing artefacts both specific and unspecific to Illumina sequencing, so a number of routines have been rewritten and/or expanded to correctly handle these artefacts. Example given: One sometimes find "genomic" reads in the data sets, which many people attribute to contamination with genomic sequence. While I agree that this is possible and sometimes likely, a couple of observations I made let me think that it's sometimes not contamination, but simply the non-pre-processed, non-spliced mRNA that cought caught by NGS. So I put some efforts in getting those things not assembled into 'real' mRNA. It is therefore not too surprising to me that some key numbers in the assemblies changed, though I cannot judge whether the numbers you see are "too large" or "OK". > My first guess was that both versions give the same core contigs, and that > they differ on shorter and less covered contigs, but on the contrary, when > looking at it into more details, the "specific contigs" from both > assemblies were long and highly covered contigs (>1.2kb, >60 reads on > average per contig), > compared to the complete original data sets (850bp , 30 reads per contig > on average). > > Can you comment on this ? And maybe point me to other methods/metrics to > look at in order to compare both assemblies ? What I've seen other people often do (in papers & posters) is to simply BLAST against protein NR database and count the number of contigs which have "good" hits. I'm not sure whether this is the most accurate measurement of assembly quality, but it's not a bad one per se and as I have no other alternative I'd also recommend doing that. On the other hand, it might be interesting to find out what happened to reads of contigs which are not present in the "other assembly." What I'd do in your place would be to search for a couple of contigs which are "good" in your eyes (long and good coverage) and see what happened in the other assembly. Of course, this is just "poking around" a bit and certainly not comprehensive, but it helps to get a feeling whether to trust or distrust one or the other assembly. I'd be happy to hear what you find out there. Best, Bastien