[mira_talk] Re: 454 cleaning

  • From: Robin Kramer <kodream@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 30 Nov 2010 09:25:45 -0700

Thanks, Laurent,

It is hard to tell if SMALT is doing any better with those settings.

Done merging SSAHA2 vector screen data.


Clipping or tagging poly A/T stretches at ends of reads ... done.

======================================
Pool statistics:
Backbones: 0    Backbone rails: 0

                Sanger  454     PacBio  Solexa  SOLiD
                ----------------------------------------
Total reads     0       906300  0       0       0
Reads wo qual   0       0       0       0       0
Used reads      0       885545  0       0       0
Avg tot rlen    0       536     0       0       0
Avg rlen used   0       333     0       0       0

With strain     0       0       0       0       0
W/o clips       0       1175    0       0       0
===================================

Compared with

Done merging SSAHA2 vector screen data.


Clipping or tagging poly A/T stretches at ends of reads ... done.

==============================================
Pool statistics:
Backbones: 0    Backbone rails: 0

                Sanger  454     PacBio  Solexa  SOLiD
                ----------------------------------------
Total reads     0       906300  0       0       0
Reads wo qual   0       0       0       0       0
Used reads      0       878780  0       0       0
Avg tot rlen    0       536     0       0       0
Avg rlen used   0       327     0       0       0

With strain     0       0       0       0       0
W/o clips       0       1175    0       0       0
==================================

In any case the vector is still making chimera looking things by the
cat
SRR054580_Asha_assembly/SRR054580_Asha_d_results/SRR054580_Asha_out.unpadded.fasta
| seqs_filter_by_len -s 100 | grep AAGCAGTGGTATCAACGCAGAGTACGGGGG|wc -l
3498
The adapter is highlighted in this chimera.
CAACTCCAACGCATGAATGCCCTCAAGCAGTGGTATCAACGCAGAGTACGGG
GGGTGGGTTCATGAGACATGGAACCCTA

I am wondering if MIRA is reconstructing noisy adapters, that the aligners
aren't able to find, almost doing too good of a job being able to assemble
even the faintest of adapter sequence signal.

In any case, I think I will try crossmatch.

Sincerely yours,

Robin

On Mon, Nov 22, 2010 at 6:56 AM, Stephen LeGrande <stlegrande@xxxxxxxxx>wrote:

> On Sat, 2010-11-20 at 16:04 +0100, Bastien Chevreux wrote:
> >
> > Then I suppose that not all vectors were found and marked by SSAHA2. I've
> > noticed that SSAHA2 is not always finding everything and I got reports
> from
> > other people stating the same. Which is bothersome.
>
>
> I also noticed the SSAHA2 is prone to randomly overlook vector sequences
> that are specified in the  vector file.
> Especially if you are looking for short adapter/vector stretches.
>
> I am dealing with 454 reads that have a 9 nt long adapter part on their
> 5' end.
> (From that 4 bases are coming from the Titanium A sequencing adapter and
> 5 bases from a labelling tag).
>
> When investigating the SSAHA2 output files I noticed that the success
> rate of reporting this motif was about 60 per cent.
> (For more details see my posting in this newsgroup  from 18th September
> 2010).
>
> >From a discussion with the present SSAHA2 developer, Hannes Postingl I
> learned the followings:
>
> (1) SSAHA2 is not designed for identifying sequences shorter than 15
> nucleotides.
>
> (2) SSAHA2 uses (hard-wired) heuristics to speed up mapping of
> sequencing
> reads against genomic reference sequences. These heuristics prevent
> some
> very short matches from being detected when there are other, better
> matches
>
> (3) Fortunately, there is another vector aligner program called SMALT
> (being developed by the same author),
> which is more suitable to identify short vector sequences in the
> sequencing reads:
>
>        http://www.sanger.ac.uk/resources/software/smalt/
>
> (4) SMALT has several output format options, including SSHA2.
> However, the "ssaha" format  output file of SMALT slightly different
> from the native SSAHA2 file.
> Luckily, the latest MIRA versions (3.2.1rc2 and 3.3.4) can handle the
> SMALT output files as well.
> (Compliments to Bastien for the quick response !)
>
>
> Some hints for using SMALT in MIRA:
>
> Vector identifying with SMALT is a two-step procedure.
>
> During the first step an index (idx)  of searching words hast be built
> (smalt index) ;
> Then the query sequences (vector ends, adapters) are mapped onto the
> reference sequences (smalt map):
>
>        smalt index -k 7 -s 1 idx vector.fasta
>
>        smalt map -f ssaha -d -1 -m 7 idx seqs.fasta  > seqs.ssaha_out
>
> (Presumed that you have your vector data  in    /your/path/vector.fasta
> and your sequence reads     in  /your/path/seqs.fasta)
>
> The output file of SMALT  (seqs.ssaha_out)  can be now used along with
> the  -CL:msvs=yes option in MIRA.
>
>
> IMPORTANT: The present (0.4.1) version of SMALT has a bug that often
> causes segmentation fault when the  "-f ssaha" (SSAHA  output format)
> option is set.
>
> Get the latest (bug fixed) binaries from here:
>
>   ftp://ftp.sanger.ac.uk/pub/hp3/smalt-0.4.1.1.tgz
>
> (Or wait until the next (0.4.2) version will be publicly available.
>
>
> I made a test run  MIRA V3.2.1rc2 using 454 reads with SMALT clipping.
> Everything run smoothly. The assembly of 516000 reads took 7 hours and
> 15 minutes  on a Linux cluster node with 72 Gig RAM.
>
>
> Cheers
>
> Stephen
>
>
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>

Other related posts: