Thanks, Laurent, It is hard to tell if SMALT is doing any better with those settings. Done merging SSAHA2 vector screen data. Clipping or tagging poly A/T stretches at ends of reads ... done. ====================================== Pool statistics: Backbones: 0 Backbone rails: 0 Sanger 454 PacBio Solexa SOLiD ---------------------------------------- Total reads 0 906300 0 0 0 Reads wo qual 0 0 0 0 0 Used reads 0 885545 0 0 0 Avg tot rlen 0 536 0 0 0 Avg rlen used 0 333 0 0 0 With strain 0 0 0 0 0 W/o clips 0 1175 0 0 0 =================================== Compared with Done merging SSAHA2 vector screen data. Clipping or tagging poly A/T stretches at ends of reads ... done. ============================================== Pool statistics: Backbones: 0 Backbone rails: 0 Sanger 454 PacBio Solexa SOLiD ---------------------------------------- Total reads 0 906300 0 0 0 Reads wo qual 0 0 0 0 0 Used reads 0 878780 0 0 0 Avg tot rlen 0 536 0 0 0 Avg rlen used 0 327 0 0 0 With strain 0 0 0 0 0 W/o clips 0 1175 0 0 0 ================================== In any case the vector is still making chimera looking things by the cat SRR054580_Asha_assembly/SRR054580_Asha_d_results/SRR054580_Asha_out.unpadded.fasta | seqs_filter_by_len -s 100 | grep AAGCAGTGGTATCAACGCAGAGTACGGGGG|wc -l 3498 The adapter is highlighted in this chimera. CAACTCCAACGCATGAATGCCCTCAAGCAGTGGTATCAACGCAGAGTACGGG GGGTGGGTTCATGAGACATGGAACCCTA I am wondering if MIRA is reconstructing noisy adapters, that the aligners aren't able to find, almost doing too good of a job being able to assemble even the faintest of adapter sequence signal. In any case, I think I will try crossmatch. Sincerely yours, Robin On Mon, Nov 22, 2010 at 6:56 AM, Stephen LeGrande <stlegrande@xxxxxxxxx>wrote: > On Sat, 2010-11-20 at 16:04 +0100, Bastien Chevreux wrote: > > > > Then I suppose that not all vectors were found and marked by SSAHA2. I've > > noticed that SSAHA2 is not always finding everything and I got reports > from > > other people stating the same. Which is bothersome. > > > I also noticed the SSAHA2 is prone to randomly overlook vector sequences > that are specified in the vector file. > Especially if you are looking for short adapter/vector stretches. > > I am dealing with 454 reads that have a 9 nt long adapter part on their > 5' end. > (From that 4 bases are coming from the Titanium A sequencing adapter and > 5 bases from a labelling tag). > > When investigating the SSAHA2 output files I noticed that the success > rate of reporting this motif was about 60 per cent. > (For more details see my posting in this newsgroup from 18th September > 2010). > > >From a discussion with the present SSAHA2 developer, Hannes Postingl I > learned the followings: > > (1) SSAHA2 is not designed for identifying sequences shorter than 15 > nucleotides. > > (2) SSAHA2 uses (hard-wired) heuristics to speed up mapping of > sequencing > reads against genomic reference sequences. These heuristics prevent > some > very short matches from being detected when there are other, better > matches > > (3) Fortunately, there is another vector aligner program called SMALT > (being developed by the same author), > which is more suitable to identify short vector sequences in the > sequencing reads: > > http://www.sanger.ac.uk/resources/software/smalt/ > > (4) SMALT has several output format options, including SSHA2. > However, the "ssaha" format output file of SMALT slightly different > from the native SSAHA2 file. > Luckily, the latest MIRA versions (3.2.1rc2 and 3.3.4) can handle the > SMALT output files as well. > (Compliments to Bastien for the quick response !) > > > Some hints for using SMALT in MIRA: > > Vector identifying with SMALT is a two-step procedure. > > During the first step an index (idx) of searching words hast be built > (smalt index) ; > Then the query sequences (vector ends, adapters) are mapped onto the > reference sequences (smalt map): > > smalt index -k 7 -s 1 idx vector.fasta > > smalt map -f ssaha -d -1 -m 7 idx seqs.fasta > seqs.ssaha_out > > (Presumed that you have your vector data in /your/path/vector.fasta > and your sequence reads in /your/path/seqs.fasta) > > The output file of SMALT (seqs.ssaha_out) can be now used along with > the -CL:msvs=yes option in MIRA. > > > IMPORTANT: The present (0.4.1) version of SMALT has a bug that often > causes segmentation fault when the "-f ssaha" (SSAHA output format) > option is set. > > Get the latest (bug fixed) binaries from here: > > ftp://ftp.sanger.ac.uk/pub/hp3/smalt-0.4.1.1.tgz > > (Or wait until the next (0.4.2) version will be publicly available. > > > I made a test run MIRA V3.2.1rc2 using 454 reads with SMALT clipping. > Everything run smoothly. The assembly of 516000 reads took 7 hours and > 15 minutes on a Linux cluster node with 72 Gig RAM. > > > Cheers > > Stephen > > > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html >