[mira_talk] Re: Confirming if mate pair information from paired-end Sanger data is being considered during assembly process

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 6 Mar 2012 00:40:35 +0100

On Mar 5, 2012, at 20:53 , Nathan McNulty wrote:
> I'm currently working on assembling a microbial genome (~7.1 Mbp) using a 
> combination of paired-end Sanger data and 454 (FLX+, non-PE) data.
> [...]
> Generated 363171 unique template ids for 384224 valid reads.
> Done merging XML data, matched 333883 reads.
> [...]
> However, I can't find any other sections in the log file that describe the 
> incorporation of the mate pair information.

Correct, that is the only place to look when one wants to know whether the 
paired-end information was loaded correctly.

> I was really hoping someone could tell me how I might confirm that the mate 
> pair info is being taken into consideration in the actual assembly process.  
> Is there somewhere in the log file I should look?  In one of the other MIRA 
> outputs?

 The normal behaviour of MIRA is that when paired-end info is available, it 
will be used. You can ascertain this at two different places:
1) in the parameter dump at the start of the assembly process, make sure that 
-GE:uti=yes (it could be "no", meaning that template info would not be used 
even if present and loaded)
2) in the contig building process, look out for lines containing 't' or 's'. 
E.g., if you see lines like

   [435] +++++a++++tt+++s++++a++++st++++++++  2078    4 / 34 / 3

 (easiest would probably be to grep the logfile for this kind of things)

> However, I find that the results for #1 and #2 look pretty much the same.  
> It's hard to believe that the mate pair information wouldn't significantly 
> improve my assembly

Depends entirely on your organism and the size of the paired-end library. E.g., 
I know a nice little 4 MB bug which, without any paired-end data at all, 
assembles nicely into ~15 contigs just by using older 454 FLX and Solexa 36bp 
reads. From the 14 gaps, 11 or 12 are rRNAs with a repeat size of ~6kb, so even 
the "standard" 3kb libraries would not really help in assembly (but scaffolding 
perhaps).

On the other end of the scale are beasts (which can be even small) with lots 
and lots of 100% identical repeats ... there one is lost without paired-end 
data.

B.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: