Read distribution is a good bet. I can't speak for solexa data but we've done a lot of 454 sequencing on various herpesviruses with a GC content around 70%. What we find is the coverage is good for the UL and US regions. These have a slightly lower GC content than the average. However, the RL and RS regions which have a higher than average GC content (up to 80%) tend to be sparsely covered. Here's my take on what's going on. We know from other work that getting PCRs to work in general on these viruses can be problematic. The RL/RS regions are particularly frustrating. Often extensive optimisation is needed for each and every target and there are no universal set of conditions that can be applied. Since both sequencing methods employ PCR steps I think there are just some regions that fail to amplify and therefore are not represented in the libraries. So it may not have anything to do with MIRA at all - it's a bias in the sequencing techniques. Just have to wait for direct molecular sequencing to get around this one ;-> Shaun From: Phillip San Miguel <pmiguel@xxxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2011-05-17 10:16 AM Subject: [mira_talk] High GC genomes and mira Sent by: mira_talk-bounce@xxxxxxxxxxxxx On 5/16/2011 1:17 PM, Bastien Chevreux wrote: > On May 16, 2011, at 15:16 , Phillip San Miguel wrote: > >> I tried MIRA V3.2.1.15 on a 70% GC bacterial genome (Deinococcus) at >> around 100x coverage with solexa PE 101 base reads. My N50 contig >> size was 4630 bases. That seems short to me, but it might be a result >> of the 70% GC. So I decided to de novo assemble a 50% GC data set >> from the same run. > > That's bad, really bad. You are the second report I get that > apparently, MIRA has problems with high GC Solexa data sets. The first > being a supersecret bug of a big company, I cannot get the data to see > what's causing havoc. Would it be possible for me to have a look at > that thing? No promises, but it might help. > > B. > Probably, just let me check with the owner of the sequences. However, the short contig lengths may derive from something trivial: read distribution bias. An Eland/Gerald mapping of our Illumina Salmonella reads produces a reasonably even coverage depth across the the genome. A similar mapping of our Illumina Deinococcus reads shows mostly 50-150x coverage, but also frequent regions with very low coverage (a few X coverage -- or zero). -- Phillip -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html