[mira_talk] Re: Where are my scaffolds?

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 17 Aug 2009 00:09:56 +0200

On Freitag 14 August 2009 Marcin Swiatek wrote:
> I seem to have difficulties getting good results from Mira. Or perhaps
> 'expected' would be a better word. Here is my story: I am trying to
> assemble the genome of a strain of a Lactobacillus bacteria. It is a
> naughty little microbe [...]

Hi Marcin,

welcome to the club. Lactobacillus has been a nightmare project also for me. 
Especially as I had no paired-end at the time.

> I got decently looking results, but there is one thing I
> don't understand: where all these paired ends went? They are in the input
> files I think, I saw these reads in the generated traceinfo file...

My first guess would be: in the contigs, where they belong.

> However, while both Celera and Newbler produced contigs *and* scaffolds, in
> Mira's output I find contigs only.

In the beginning the users I know liked to use MIRA and combine it with 
dedicated scaffolders (BAMBUS, own scripts etc.), therefore I never really 
felt the urge to implement an own scaffolder. This has considerably changed as 
inquiries for a scaffolder have noticeably increased in the past year. I think 
I'll have to cave in at some point: not for the 3.0 version which I'm 
finalising at the moment, but it's now pretty high on the TODO.

In the mean time, some time ago I had asked a few people who I know use the 
AMOS scaffolder to write a short HOWTO for data comming from MIRA. But I 
haven't heard back from any at the moment.

> Contigs computed by Mira (using
> 'accurate') are quite similar in number and size distribution to what I get
> with other assemblers, but I see no scaffolds and no evidence of use of
> paired end data.
> [...]
> Now the questions. Firstly, how do I tell if paired ends were indeed used
> or not. Secondly, if they weren't, how do I go about putting them to use.

MIRA uses them without making too much noise about it. One way for you to 
check: in the output, there's a line saying

   Generated XXX unique template ids for YYY valid reads.

If XXX is smaler than YYY, then MIRA has assigned read-pairs to templates and 
uses that information later on in the assembly:

[1321] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  
[1381] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++s+  
[1440] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  
[1500] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  
[1560] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  
[1620] ++++++++++++++++++++++++++++++++++++++++++++++++++t+t+t+++++  

The "+" shows reads assembled without problems, "s" means a read has been 
rejected at a given contig position because of template size violation and "t" 
because of template direction violation. So you'll see the template usage only 
when there's a (temporary) problem during construction, all others are 
assembled without any more notice.

> And if they were, why don't I see scaffolds (or longer contigs with little
> gaps in them).

Because there's no scaffolder. As I wrote, it's on the TODO. In the mean time, 
I'd propose to use the one from AMOS as I heard it works quite well (never 
used it though).

> I will have another query, but I think I will try that one by one.

No problem.


Best,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: