[mira_talk] Re: Mapping on a large and repeated genome

  • From: Magalie.LEVEUGLE@xxxxxxxxxxxx
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 19 Oct 2011 17:40:23 +0200

Hi Bastien,

I was afraid you would answer something like that :D

Thank you very much for your fast answer, I will consider other options 
for my mapping, but will stay aware of your next developpements, maybe 
your 'radar' will 'blip' soon on this issue!


Cheers,

Magalie



De :    Bastien Chevreux <bach@xxxxxxxxxxxx>
A :     mira_talk@xxxxxxxxxxxxx
Date :  18/10/2011 22:15
Objet : [mira_talk] Re: Mapping on a large and repeated genome
Envoyé par :    mira_talk-bounce@xxxxxxxxxxxxx



On Oct 18, 2011, at 18:23 , Magalie.LEVEUGLE@xxxxxxxxxxxx wrote:
I am trying to use mira 3.4 to map  1.5 milion 454 titanium reads on a 
large (2 Gb) and mostly repeat-containing plant genome. 

I know that mira is not currently optimised for this type of genome, ...

It indeed is not.

{...}
So I suspect the delay to be because of those "a", and of the large 
numbers in //, I've noticed a few lines with very large number in the 
first position too : 

[410891] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
300239183       38304 / 9405835 / 0 

The problem are not the 'a' per se, rather the pretty large numbers beside 
the line (which are timing information).

As each trial takes a few days at least, I would like to know what could I 
change in my parameters now, or maybe try the developement version which 
was given in the other thread? 

I can send my log file if it helps.. 

No, it would not ... the above lines told me everything I needed to know.

MIRA is indeed currently not really suited for this use case. The problem 
lies in the rather simplistic data structures with which contigs are 
represented internally. They're pretty good for de-novo assembly where 
contigs attain a couple of 100kb, but they start to fail in the megabase 
range and certainly so when contigs go well above 10 megabases.

Although, I have to rephrase that: they fail big time as soon as 
insertions or deletions need to be done. Which is certainly the case for 
mapping 454 data, for Illumina data one does not really feel the problem 
as there are not too many indels.

This is a weakness which has bitten me in the past few months, especially 
with mapping of IonTorrent data ... or de-novo of 454 and Illumina 
hybrids. So I have that on my radar (and it's quite high on the priority 
list).

Back to your problem: sorry, there is absolutely nothing you can do from 
the parameter side. The only work-around I can propose to you at the 
moment is to sub-divide your reference genome in chunks of 10 to 20 
megabases. Not really an option, I know.

Best,
  Bastien

Other related posts: