I have an interesting and difficult assembly that I'm attempting with Mira. I'm working with a bacteria that has a large number of Non Ribosomal Peptide Synthases (NRPS) and Poly Ketide Synthases (PKS) and there are many domain and gene duplications that have occurred during the course of evolution. The bacteria has a GC content in excess of 70%.
I have one gene in this bacteria that has a large number of domains, some of which are exactly duplicated (>500bp) in the gene. From the chemical structure of the compound made by this gene, I have a good idea of what the domain structure ought to be.
We have an extensive collection of data, both 454 and Illumina, for this bacteria. For Illumina, we have paired end data of various lengths. I've been experimenting with different combinations of data to see if I can get a complete assembly of the gene of interest above.
Just recently, I started a 'normal' Mira run using 3.4rc2, and I enabled intermediate FASTA output at every pass. On the second pass, Mira generated my gene with the expected pattern of domains. However, on succeeding passes, it eliminated some of the repetitive sequences, and at the end of the run, I had lost about 30% of the expected domains.
Has anyone else run into issues like these? How can I control the decision making with regard to repeats? Is there any way of having Mira report a graph of the possible assemblies (like Allpaths). (BTW, I don't have data that is suitable for Allpaths).
Thanks. --Bob
begin:vcard fn:Robert Bruccoleri n:Bruccoleri;Robert org:Audacious Energy, LLC and Congenomics, LLC adr:;;;;;;USA email;internet:bruc@xxxxxxx title:President version:2.1 end:vcard