[mira_talk] Re: assembly parameters and more

On Thursday 12 March 2009 Davide Sassera wrote:
> I'm sorry but I'm not sure what file you mean. Is it the EdIt.log? that
> file is empty. or is it a file in the log folder? I'm pretty sure you
> already explained this in an older message but I really cannot find it,
> sorry.

When you start mira with
                mira ...someoptions... >&log_assembly.txt
then "log_assembly.txt" is the output log.

> Ok, I tried exactly as you write but it stops because of 0.3% megahubs,
> so I guess I'm in deep trouble?
> I'm currently trying with mnr on, but I was not sure about this, maybe I
> should have set the allowed hubs to 0.4%?
> Now it's in the preassembly step and it is already swapping 1Gig

Well, stop that assembly too, it won't be really useful. Not at all, if my 
guess is correct.

You are in deep trouble. While I cannot give you a definitive number, megahubs 
occur when a read has thousands of possible overlaps with other reads. While 
this can happen for some eukaryotic projects, I've never seen a prokaryot 
where this could be even remotely possible. Every time this happened, it was 
because the data pre-processing had a glitch.

Two possibilities:
1) you have an almost impossible organism
2) the sequences you have are contaminated or not correctly clipped

Btw, my money is on the second. But should it be the first and we can get it 
tackled, reserve yourself a place in Nature or Science :-)

Now, what can you do?

You have to find out what is causing so much trouble and which sequences are 
highly repetitive in your data set. Well, I also had these kind of problems at 
one time or another, so there are a few thing where MIRA helps you.

When using -SK:mnr (which you did), two files are written into the log 
directory:

     mira_int_skimmarknastyrepeats_hist_preassembly.0.lst
     mira_int_skimmarknastyrepeats_nastyseq_preassembly.0.lst

The "hist" file please pack together and send me, it just contains histogram 
numbers I'd like to have a look at and eventually make a graph or two to show 
you for comparison.

The "nastyseq" file you will have to have a look at yourself and try to find 
out 
what makes your data nasty. It's a key-value file with the name of the sequence 
as "key" and the nasty sequence as "value". "Nasty" in this case means, that 
the sequences are made of k-mers belonging to the top 10% of all k-mers in the 
project.

It looks a bit like this:

sequence1     GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
sequence2     CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
sequence3     GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
etc.

You will need to search some databases with the "nasty" sequences. You might 
find vector sequences, adaptor sequences or even human sequences (don't laugh, 
this type of contamination happens quite easily with data from new sequencing 
technologies). After a while you will get a feeling what constitutes the 
largest part of your problem and can start to think of taking countermeasures 
like filtering, clipping, masking etc. ... which I will describe once you've 
found out :-)

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: