[mira_talk] Mapping on a large and repeated genome

  • From: Magalie.LEVEUGLE@xxxxxxxxxxxx
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 18 Oct 2011 18:23:34 +0200

Hi,

I am trying to use mira 3.4 to map  1.5 milion 454 titanium reads on a 
large (2 Gb) and mostly repeat-containing plant genome.

I know that mira is not currently optimised for this type of genome, but 
I've tried a few combination of parameters, and if the mapping seems 
finaly to run, it is taking forever, and so i would like to know how I 
could improve my settings.

At first, I've tried mapping on the "normal" genome, but reached very high 
level of memory ( around 330Go) and after 5 days the mapping was stuck in 
the first contig. As it was not possible to keep this going I tried to 
modify my command line to include the -highlyrepetitive option as shown 
below, but got mira to stop due to megahubs. I changed the nrr parameter 
from 10 to 5 and reduce the mhpr to 100 without more success.

mira -project=m04 -job=mapping,genome,accurate,454 --notraceinfo 
--highlyrepetitive -GE:not=12 -SB:lsd=yes:bsn=m_v1:bft=fasta:bbq=30 
-SK:pr=95:mhpr=100:not=12:nrr=5 454_SETTINGS -LR:ft=fasta -AL:mo=20:mrs=95 
   1>> m04.log

I did not  increase mmhr parameter for now, but tried another approach to 
avoid the large portion of repetitive sequences.
As my reads are theoretically targeted on low frequency sequences, I 
decided to do the mapping on a masked version of my genome. This time it 
worked, but after 6 days of computing, it is only starting to map the 
second chromosome.. 

Here is my command line :

mira -project=m04_wgm -job=mapping,genome,accurate,454 --notraceinfo 
-GE:not=12 -SB:lsd=yes:bsn=m_v1:bft=fasta:bbq=30  -SK:pr=95:not=12 
454_SETTINGS -LR:ft=fasta -AL:mo=20:mrs=95    1>> m04_wgm.log


I searched in the archive of the mailing list, and found a thread about a 
similar problem ("where is my assembly at?"), so I looked at my log file 
and found this : 


[522402] +++a++a++a+aa++a+++a+a++a+aa+++++a+a++aa++++aa+++++++++++a++ 
300282317       4 / 1345519 / 29
[522444] a+++++a+a++++++++a+++++++++aaa+a+++++a+a++aaa+a+aa+++aaa++++ 
300282320       3 / 844126 / 30
[522485] a++++a+aa+aa++++++a+++++++aa+++++++++a+++++aaaa+++a+++a+++++ 
300282325       8 / 714580 / 35
[522529] ++aa+++a++a+++++++aa+++a+++++++a+aa+a++++++++++++a++++aa++++ 
300282332       4 / 1190728 / 32
[522575] a++a++a+aa++a+++aa+a+++a+++a+++++a+a+aa+a+++++a+++++++a+++++ 
300282338       4 / 630878 / 37
[522617] +a+aa+a++++aaa++a++a++aaaa++a+a++aa++a++a++a++++++a+a+++a+++ 
300282342       3 / 522819 / 28
[522654] ++++++++++++++++a++a+++a+++aa+++a+aa++++++++aaa+a++aa+++a+++ 
300282344       3 / 218685 / 28

So I suspect the delay to be because of those "a", and of the large 
numbers in //, I've noticed a few lines with very large number in the 
first position too : 

[410891] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
300239183       38304 / 9405835 / 0

As each trial takes a few days at least, I would like to know what could I 
change in my parameters now, or maybe try the developement version which 
was given in the other thread?

I can send my log file if it helps..

Thank you very much for reading me,

Magalie

--
Magalie Leveugle, PhD
Research Scientist
Bioinformatics Team - Upstream Genomics Group
BIOGEMMA
Site de La Garenne
CS 90126
63720 CHAPPES, FRANCE
Tel : 04 73 67 88 57

Other related posts: