[mira_talk] Re: Assembling the readrepeats

  • From: Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 24 Apr 2014 19:22:09 -0400

Dear Bastien,
Thanks for the suggestions -- I'll let you and the list know the outcome if I get the opportunity to test them. PacBio wasn't an option for this at the time. It might be in the future.

    Cheers,
    Bob

On 04/19/2014 06:45 AM, Bastien Chevreux wrote:
On 19 Apr 2014, at 1:53 , Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx> wrote:
I'm working on a bacteria which has a lot of apparently repeated sequences. The file in 
the "_d_info" directory whose name ends in _readrepeats.lst is 340 MB long. The 
repeat ratio tabulation in the log file shows that there are lot of reads containing 
sequences repeated 100's of time relative to most of the genome.
100’s of times? You’re having some interesting bacteria there. If I had to 
guess: a non lab strain with at least 3 MB and lots of transposons / insertion 
elements. One or multiple phages may also be a possibility there.

[…]
My goal is to figure out what's repeated.
[…]
but it's clear the contigs are chimeras.
[…]
I haven't tried to optimize this manifest -- it gives me results that I can 
interpret, but I wonder if there's a better solution out there.
We’re speaking about Illumina data, right? Then I would approach this from 
another angle.

Use the read list in the debris file to extract the reads with repeats from 
your whole set (maybe just take the reads marked MNRr, venturing into HAF7 and 
HAF6 if you really want). Then treat these reads as if they were coming from 
transcripts and assemble them in EST mode. Note: just having been reminded of 
this in another thread on the list … switch off digital normalisation and the 
masking of nasty repeats for this special case.

With that approach the normal security measures of MIRA for assuring data 
quality are kept working, and repeats having differences only in single bases 
are still kept apart, probably reducing the number of chimeras. As bonus, you 
will be able to deduce the approximate number of copies for each repetitive 
element in the genome by putting the contig coverage numbers of the “transcript 
assembly” in relation to the average genome coverage you will have gotten from 
the whole genome assembly. WARNING: this assumes the DNA of your organism has 
been extracted in stationary phase. If this was done in exponential phase you 
have a couple of other problems.

Hope that helps,
   Bastien

PS: oh, and I suppose I do not need to point you at PacBio, do I?





begin:vcard
fn:Robert Bruccoleri
n:Bruccoleri;Robert
org:Audacious Energy, LLC and Congenomics, LLC
adr:;;;;;;USA
email;internet:bruc@xxxxxxx
title:President
version:2.1
end:vcard

Other related posts: