Hi Rameez,
First, I would try the newest version of MIRA, hopefully the assembly
would improve.
I am not sure if it is necessary to remove PCR duplicates. This helps in
extreme cases, like with libraries enriched using DNA array capture,
etc. In your case, Prinseq reports 1% duplicates, which is expected
given the coverage.
I would worry about the genome coverage distribution since your bug
seems extremely GC rich. Check the alignments of the largest contigs or
make some coverage files. If the coverage happens to go very high-very
low in a frenzy manner then you cannot expect a perfect de novo assembly
but you can try assembling it in EST mode (you risk more misaemblies, so
do not naively do synteny analyses, but the genic part should be
assembled in longer contigs).
Also, because of the high GC% and possibly very low coverage areas and
more sequencing errors, try another assembly with the complete dataset
(disable MIRA's warning on high coverage).
Cheers,
Andrej
On 06/16/2015 10:36 AM, Rameez Mj wrote:
I have a bacterial whole genome project going on with iontorrent proton 200bp platform. I analysed my data with princeq. It being of high average coverage suggested by MIRA(129x). I removed exact duplicates with princess-lite and extracted 1800000 reads randomly using a python script "subsampler". MIRA assembles it to >3000 contigs (details are there in the assembly log).Result of data analysis using princess, Assembly log and result info generated by MIRA is attached.
Now I need any experienced person to suggest kindly that is it wise to continue with this data? How hard it will be for me to complete this project successfully. what are the best tools and methods I can use on this?
I know the question is raw but I expect something from you.Thanking you in advance.