[mira_talk] Re: Making the best assembly

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 7 Dec 2015 01:11:16 -0500

On 06 Dec 2015, at 23:00 , C Jenkins <cej.jenkins@xxxxxxxxx> wrote:

I have a largely undescribed species of a trematode parasite. It is similar
in life cycle to Schistosoma mansoni.
I have 454 and illumina single end reads from 4 different populations. I need
to first create a reference transcriptome.
The illumina data is... rough. I first assembled it using Trinity, and found
only 531 contigs... which is orders of magnitude less than I expected.
So I used MIRA to do a 454 assembly, a illumina assembly and a hybrid
assembly. Now I'm trying to figure out which is any good.
[…]

We are talking about EST/CDNA/RNASeq, right? Because for eukaryotic genomes,
MIRA is definitively not the right tool.

First things first: if you used MIRA 4.0.x, then give the current development
version a try. It’s light years ahead of 4.0.
Second: that table of yours … there’s something I do not understand: the
columns. E.g.: the 454 assembly has 33k reads but a coverage of 43k? Or: the
Illumina assembly really has only 72k reads as input?
Third: for the Illumina assembly, did you give MIRA “unprocessed” reads? This
is recommended.

What I normally do for RNASeq assemblies with Illumina; I take a very small
subset (100k or so) and assemble that to see whether there are unexpected
things like, e.g. an unfiltered library with 80% rRNA or similar funny
surprises. Then a quick run with 1m reads and if all seems OK, I generally
start the assembly with 10 to 15 million read pairs (20 to 30m reads) as this
is generally regarded as sweet spot for transcriptome assemblies.

I haven’t tried 4.9.x on 454 data though, so I cannot predict its performance
there.

B.



--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html

Other related posts: