[mira_talk] Re: small contigs in first pass

  • From: "Reith, Michael" <Michael.Reith@xxxxxxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 14 Jan 2010 20:27:10 -0500

Thanks Bastien, a very useful suggestion.  I also noticed in looking through 
the contigreadlist that many of my debris contigs were composed of the 2 
complementary reads from a mate pair (my Illumina reads are 3 kb mate pairs).  
So it looks like a fair number (4-5%) of the sequencing templates were very 
short fragments (<100 bp), which seem rather suspicious.  I'm weeding out all 
of these now.  I'll see how the mira run goes and if it still need help, I'll 
use your approach to further refine the useful reads.  Thanks,

Mike

-----Original Message-----
From: mira_talk-bounce@xxxxxxxxxxxxx on behalf of Bastien Chevreux
Sent: Thu 1/14/2010 2:43 PM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: small contigs in first pass
 
On Mittwoch 13 Januar 2010 Reith, Michael wrote:
> I'm doing an assembly of a lower eukaryote (genome ~35 Mb) using 454
>  sequences and 76 bp Illumina reads (~1.2M & 20M sequences, respectively). 
>  Mira is just in the first pass through the data, but has been writing the
>  contigs of the *_out_pass1.caf for more than a day now.  The first 3500 or
>  so contigs look to be useful (>500 bp, something approaching the expected
>  coverage), but since that point the vast majority of the contigs are short
>  with low coverage and recently, they're mostly 2 Illumina reads.  I'm now
>  past contig 60000 and there still appears to be a long way to go (>1.4M
>  unused reads...= 700,000 2 read contigs?).  I'm wondering if there's a
>  command line switch I can use to avoid the generation of these small,
>  probably useless contigs during the mira run (I know they can be filtered
>  out afterward).  Or should I just use a half or a quarter of the Illumina
>  reads in doing the assembly?  Any help or advice would be appreciated.

Hi Mike,

these "contig debris" are a problem at the moment and absolutely typical for 
Illumina. Setting up contig structures for assembly of a new contig in MIRA is 
a pretty cost intensive operation and these small things are ruilly ruining 
the day. I have that on my list of TODOs, but it'll take a while still.

There's one thing where I can re-assure you: you won't get 700k contigs. At 
some point, there will only be singlets left and MIRA kicks them out pretty 
fast then.

What I would propose in the meat time is this: once the first pass is done, 
stop MIRA. Then, in the log directory, search for the info files "contigstats" 
and "contigreadlist". In contigstats you can determine which contigs of the 
first pass you want keep, in contigreadlist you get the reads of which these 
are composed. Use these to make a list of reads you want to keep, then use the 
"fastaselect.tcl" script of the MIRA package to create a new input file from 
the original input and the list of reads you want to keep.

Should take you an hour or so of work. Then restart the assembly with that. 

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: