[mira_talk] Re: 454 cleaning

  • From: Stephen LeGrande <stlegrande@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 22 Nov 2010 13:56:14 +0000

On Sat, 2010-11-20 at 16:04 +0100, Bastien Chevreux wrote:
> 
> Then I suppose that not all vectors were found and marked by SSAHA2. I've 
> noticed that SSAHA2 is not always finding everything and I got reports from 
> other people stating the same. Which is bothersome.


I also noticed the SSAHA2 is prone to randomly overlook vector sequences
that are specified in the  vector file. 
Especially if you are looking for short adapter/vector stretches.

I am dealing with 454 reads that have a 9 nt long adapter part on their
5' end. 
(From that 4 bases are coming from the Titanium A sequencing adapter and
5 bases from a labelling tag).  

When investigating the SSAHA2 output files I noticed that the success
rate of reporting this motif was about 60 per cent.
(For more details see my posting in this newsgroup  from 18th September
2010). 

>From a discussion with the present SSAHA2 developer, Hannes Postingl I
learned the followings:

(1) SSAHA2 is not designed for identifying sequences shorter than 15 
nucleotides. 

(2) SSAHA2 uses (hard-wired) heuristics to speed up mapping of
sequencing 
reads against genomic reference sequences. These heuristics prevent
some 
very short matches from being detected when there are other, better 
matches

(3) Fortunately, there is another vector aligner program called SMALT
(being developed by the same author), 
which is more suitable to identify short vector sequences in the
sequencing reads:

        http://www.sanger.ac.uk/resources/software/smalt/  

(4) SMALT has several output format options, including SSHA2. 
However, the "ssaha" format  output file of SMALT slightly different
from the native SSAHA2 file.
Luckily, the latest MIRA versions (3.2.1rc2 and 3.3.4) can handle the
SMALT output files as well.
(Compliments to Bastien for the quick response !) 
 

Some hints for using SMALT in MIRA:

Vector identifying with SMALT is a two-step procedure.

During the first step an index (idx)  of searching words hast be built
(smalt index) ;
Then the query sequences (vector ends, adapters) are mapped onto the
reference sequences (smalt map):

        smalt index -k 7 -s 1 idx vector.fasta

        smalt map -f ssaha -d -1 -m 7 idx seqs.fasta  > seqs.ssaha_out

(Presumed that you have your vector data  in    /your/path/vector.fasta  
and your sequence reads     in  /your/path/seqs.fasta)

The output file of SMALT  (seqs.ssaha_out)  can be now used along with
the  -CL:msvs=yes option in MIRA.


IMPORTANT: The present (0.4.1) version of SMALT has a bug that often
causes segmentation fault when the  "-f ssaha" (SSAHA  output format)
option is set.

Get the latest (bug fixed) binaries from here:

   ftp://ftp.sanger.ac.uk/pub/hp3/smalt-0.4.1.1.tgz

(Or wait until the next (0.4.2) version will be publicly available.


I made a test run  MIRA V3.2.1rc2 using 454 reads with SMALT clipping.
Everything run smoothly. The assembly of 516000 reads took 7 hours and
15 minutes  on a Linux cluster node with 72 Gig RAM.    


Cheers

Stephen



-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: