[mira_talk] de novo plant genome

  • From: Tom <twl8n@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 16 Dec 2009 14:51:30 -0500

Hi,

I'm software guy making an assembly for the cowpea. Our group has vector
trimmed, quality trimmed, methyl-filtered, gene space reads from Sanger
sequencing. MIRA does a good job assembling the 225,000 reads into
approximately 61,000 contigs and 1500 singlets. The MIRA docs are great,
but as a software guy I'm a bit weak on stats and molecular biology.
We've got a 16 processor machine with 128GB of RAM, so I've got nice
hardware to play with.

With the first pass done I'd like to improve/extend the contigs with
140,000 cowpea ESTs from HarvEST and with with 51,000 BAC end sequences
(BES) from the Legume Information System.

Several things "worked", but I'm not sure what I should have done, and
I'm not sure how to evaluate the quality of the asssembly.

Which is better, or more logical (I've tried both):

1) Throwing the GSRs and ESTs into one big file, then run MIRA as
"genome,denovo".

2) Two steps: (a) contig the GSRS, (b) the map the ESTs on using the
unpadded.fasta from (a) as the backbone.


My third question is basically what to do about repeats in the BES. When
I tried throwing the GSR contigs into a big fasta file with the BES,
MIRA complained about 1 megahub. I'm still adjusting nrr to see if I can
clear that up. The combined fasta file is simply cat gsr.fast bes.fasta 
> mira_input.fasta

Thanks,
Tom

Other related posts: