I'm working on a bacteria which has a lot of apparently repeated sequences. The file in the "_d_info" directory whose name ends in _readrepeats.lst is 340 MB long. The repeat ratio tabulation in the log file shows that there are lot of reads containing sequences repeated 100's of time relative to most of the genome.
My goal is to figure out what's repeated. It's not practical to go through all the reads in the _readrepeats.lst file, so I want to assemble the data in the _readrepeats.lst file. In preparation, I have written a script that will remove all duplicated sequences in _readrepeats.lst as well as all sequences that are subsequences of any other. This script will effectively reduce the size of the problem down many fold -- the final sequence file is now 40MB.
The question is, what parameters should I use for assembling these nasty reads?
Here's the manifest file I've tried (backslashes have been removed for clarity's sake):
project = repeats job = denovo,genome,accurate parameters = COMMON_SETTINGS -SK:bph=31:mmhr=15 -HS:mnr=no:ldn=yes:fenn=0.0001:fexn=100:fer=101:fehr=102:fecr=103 -GE:not=1 -NW:cnfs=no:cmrnl=no:cac=warn TEXT_SETTINGS -CL:pec=no -AS:epoq=no:mrpc=1 readgroup data = fa::repeats.fasta technology = text strain = Illuminabut it's clear the contigs are chimeras. Blasting and reviewing the resulting output is very feasible -- there's only 350 contigs and singlets to look at. The final contig Fasta file is 214kb.
I haven't tried to optimize this manifest -- it gives me results that I can interpret, but I wonder if there's a better solution out there.
If anyone has tried doing this, I'd appreciate hearing about your experience.
Thanks. --Bob
begin:vcard fn:Robert Bruccoleri n:Bruccoleri;Robert org:Audacious Energy, LLC and Congenomics, LLC adr:;;;;;;USA email;internet:bruc@xxxxxxx title:President version:2.1 end:vcard