On Sat, 2010-11-20 at 16:04 +0100, Bastien Chevreux wrote: > > Then I suppose that not all vectors were found and marked by SSAHA2. I've > noticed that SSAHA2 is not always finding everything and I got reports from > other people stating the same. Which is bothersome. I also noticed the SSAHA2 is prone to randomly overlook vector sequences that are specified in the vector file. Especially if you are looking for short adapter/vector stretches. I am dealing with 454 reads that have a 9 nt long adapter part on their 5' end. (From that 4 bases are coming from the Titanium A sequencing adapter and 5 bases from a labelling tag). When investigating the SSAHA2 output files I noticed that the success rate of reporting this motif was about 60 per cent. (For more details see my posting in this newsgroup from 18th September 2010). >From a discussion with the present SSAHA2 developer, Hannes Postingl I learned the followings: (1) SSAHA2 is not designed for identifying sequences shorter than 15 nucleotides. (2) SSAHA2 uses (hard-wired) heuristics to speed up mapping of sequencing reads against genomic reference sequences. These heuristics prevent some very short matches from being detected when there are other, better matches (3) Fortunately, there is another vector aligner program called SMALT (being developed by the same author), which is more suitable to identify short vector sequences in the sequencing reads: http://www.sanger.ac.uk/resources/software/smalt/ (4) SMALT has several output format options, including SSHA2. However, the "ssaha" format output file of SMALT slightly different from the native SSAHA2 file. Luckily, the latest MIRA versions (3.2.1rc2 and 3.3.4) can handle the SMALT output files as well. (Compliments to Bastien for the quick response !) Some hints for using SMALT in MIRA: Vector identifying with SMALT is a two-step procedure. During the first step an index (idx) of searching words hast be built (smalt index) ; Then the query sequences (vector ends, adapters) are mapped onto the reference sequences (smalt map): smalt index -k 7 -s 1 idx vector.fasta smalt map -f ssaha -d -1 -m 7 idx seqs.fasta > seqs.ssaha_out (Presumed that you have your vector data in /your/path/vector.fasta and your sequence reads in /your/path/seqs.fasta) The output file of SMALT (seqs.ssaha_out) can be now used along with the -CL:msvs=yes option in MIRA. IMPORTANT: The present (0.4.1) version of SMALT has a bug that often causes segmentation fault when the "-f ssaha" (SSAHA output format) option is set. Get the latest (bug fixed) binaries from here: ftp://ftp.sanger.ac.uk/pub/hp3/smalt-0.4.1.1.tgz (Or wait until the next (0.4.2) version will be publicly available. I made a test run MIRA V3.2.1rc2 using 454 reads with SMALT clipping. Everything run smoothly. The assembly of 516000 reads took 7 hours and 15 minutes on a Linux cluster node with 72 Gig RAM. Cheers Stephen -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html