Dear Bastien,Thanks for the suggestions -- I'll let you and the list know the outcome if I get the opportunity to test them. PacBio wasn't an option for this at the time. It might be in the future.
Cheers, Bob On 04/19/2014 06:45 AM, Bastien Chevreux wrote:
On 19 Apr 2014, at 1:53 , Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx> wrote:I'm working on a bacteria which has a lot of apparently repeated sequences. The file in the "_d_info" directory whose name ends in _readrepeats.lst is 340 MB long. The repeat ratio tabulation in the log file shows that there are lot of reads containing sequences repeated 100's of time relative to most of the genome.100’s of times? You’re having some interesting bacteria there. If I had to guess: a non lab strain with at least 3 MB and lots of transposons / insertion elements. One or multiple phages may also be a possibility there.[…] My goal is to figure out what's repeated. […] but it's clear the contigs are chimeras. […] I haven't tried to optimize this manifest -- it gives me results that I can interpret, but I wonder if there's a better solution out there.We’re speaking about Illumina data, right? Then I would approach this from another angle. Use the read list in the debris file to extract the reads with repeats from your whole set (maybe just take the reads marked MNRr, venturing into HAF7 and HAF6 if you really want). Then treat these reads as if they were coming from transcripts and assemble them in EST mode. Note: just having been reminded of this in another thread on the list … switch off digital normalisation and the masking of nasty repeats for this special case. With that approach the normal security measures of MIRA for assuring data quality are kept working, and repeats having differences only in single bases are still kept apart, probably reducing the number of chimeras. As bonus, you will be able to deduce the approximate number of copies for each repetitive element in the genome by putting the contig coverage numbers of the “transcript assembly” in relation to the average genome coverage you will have gotten from the whole genome assembly. WARNING: this assumes the DNA of your organism has been extracted in stationary phase. If this was done in exponential phase you have a couple of other problems. Hope that helps, Bastien PS: oh, and I suppose I do not need to point you at PacBio, do I?
begin:vcard fn:Robert Bruccoleri n:Bruccoleri;Robert org:Audacious Energy, LLC and Congenomics, LLC adr:;;;;;;USA email;internet:bruc@xxxxxxx title:President version:2.1 end:vcard