[mira_talk] Assembling the readrepeats

  • From: Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 18 Apr 2014 19:53:46 -0400

I'm working on a bacteria which has a lot of apparently repeated sequences. The file in the "_d_info" directory whose name ends in _readrepeats.lst is 340 MB long. The repeat ratio tabulation in the log file shows that there are lot of reads containing sequences repeated 100's of time relative to most of the genome.


My goal is to figure out what's repeated. It's not practical to go through all the reads in the _readrepeats.lst file, so I want to assemble the data in the _readrepeats.lst file. In preparation, I have written a script that will remove all duplicated sequences in _readrepeats.lst as well as all sequences that are subsequences of any other. This script will effectively reduce the size of the problem down many fold -- the final sequence file is now 40MB.

The question is, what parameters should I use for assembling these nasty reads?

Here's the manifest file I've tried (backslashes have been removed for clarity's sake):

project = repeats

job = denovo,genome,accurate

parameters = COMMON_SETTINGS
             -SK:bph=31:mmhr=15
-HS:mnr=no:ldn=yes:fenn=0.0001:fexn=100:fer=101:fehr=102:fecr=103
             -GE:not=1
             -NW:cnfs=no:cmrnl=no:cac=warn
             TEXT_SETTINGS
             -CL:pec=no
             -AS:epoq=no:mrpc=1

readgroup
data = fa::repeats.fasta
technology = text
strain = Illumina

but it's clear the contigs are chimeras. Blasting and reviewing the resulting output is very feasible -- there's only 350 contigs and singlets to look at. The final contig Fasta file is 214kb.

I haven't tried to optimize this manifest -- it gives me results that I can interpret, but I wonder if there's a better solution out there.

If anyone has tried doing this, I'd appreciate hearing about your experience.

Thanks. --Bob



begin:vcard
fn:Robert Bruccoleri
n:Bruccoleri;Robert
org:Audacious Energy, LLC and Congenomics, LLC
adr:;;;;;;USA
email;internet:bruc@xxxxxxx
title:President
version:2.1
end:vcard

Other related posts: