[mira_talk] Re: assembly parameters and more
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Thu, 12 Mar 2009 21:29:11 +0100
On Thursday 12 March 2009 Davide Sassera wrote:
> I'm sorry but I'm not sure what file you mean. Is it the EdIt.log? that
> file is empty. or is it a file in the log folder? I'm pretty sure you
> already explained this in an older message but I really cannot find it,
> sorry.
When you start mira with
mira ...someoptions... >&log_assembly.txt
then "log_assembly.txt" is the output log.
> Ok, I tried exactly as you write but it stops because of 0.3% megahubs,
> so I guess I'm in deep trouble?
> I'm currently trying with mnr on, but I was not sure about this, maybe I
> should have set the allowed hubs to 0.4%?
> Now it's in the preassembly step and it is already swapping 1Gig
Well, stop that assembly too, it won't be really useful. Not at all, if my
guess is correct.
You are in deep trouble. While I cannot give you a definitive number, megahubs
occur when a read has thousands of possible overlaps with other reads. While
this can happen for some eukaryotic projects, I've never seen a prokaryot
where this could be even remotely possible. Every time this happened, it was
because the data pre-processing had a glitch.
Two possibilities:
1) you have an almost impossible organism
2) the sequences you have are contaminated or not correctly clipped
Btw, my money is on the second. But should it be the first and we can get it
tackled, reserve yourself a place in Nature or Science :-)
Now, what can you do?
You have to find out what is causing so much trouble and which sequences are
highly repetitive in your data set. Well, I also had these kind of problems at
one time or another, so there are a few thing where MIRA helps you.
When using -SK:mnr (which you did), two files are written into the log
directory:
mira_int_skimmarknastyrepeats_hist_preassembly.0.lst
mira_int_skimmarknastyrepeats_nastyseq_preassembly.0.lst
The "hist" file please pack together and send me, it just contains histogram
numbers I'd like to have a look at and eventually make a graph or two to show
you for comparison.
The "nastyseq" file you will have to have a look at yourself and try to find
out
what makes your data nasty. It's a key-value file with the name of the sequence
as "key" and the nasty sequence as "value". "Nasty" in this case means, that
the sequences are made of k-mers belonging to the top 10% of all k-mers in the
project.
It looks a bit like this:
sequence1 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
sequence2 CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
sequence3 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
etc.
You will need to search some databases with the "nasty" sequences. You might
find vector sequences, adaptor sequences or even human sequences (don't laugh,
this type of contamination happens quite easily with data from new sequencing
technologies). After a while you will get a feeling what constitutes the
largest part of your problem and can start to think of taking countermeasures
like filtering, clipping, masking etc. ... which I will describe once you've
found out :-)
Regards,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: