[mira_talk] Re: PacBio CCS questions

From: Chris Hoefler <hoeflerb@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Fri, 16 Aug 2013 18:58:06 -0500

>Bonus question: are PB adaptor sequences listed somewhere on the net? The
only place I found some are in the metadata >XML files, and they told me
>   ATCTCTCTCttttcctcctcctccgttgttgttgttGAGAGAGAT
>
>Are there others?

I think that is the only adapter in use at the moment.

Do either of these help?
https://s3.amazonaws.com/files.pacb.com/pdf/Guide_Pacific_Biosciences_Template_Preparation_and_Sequencing.pdf
http://www.smrtcommunity.com/servlet/servlet.FileDownload?file=00P7000000HYU49EAH

This is the only official documentation from PacBio that I could find about
their adapter sequences and barcodes.

>Background: I'm working on the read improvement routines atm and I think
that in the 49 PB reads I took as initial test set (out of >30k from the
E.coli Nature paper), already two reads show such an inversion where there
should be none … ergo it's a sequencing artefact and 4% of reads like this
will wreak havoc with most assembly algorithms. I hate situations like
these.

How long are these chimeras? The worst offenders can probably be removed by
filtering read lengths and quality scores. But apparently these artifacts
do appear in longer reads at a non-negligible level as a result of the way
the libraries are constructed.
http://www.microbiomejournal.com/content/1/1/10

The PacBioToCA paper puts the number at ~2.5%. HGAP gets rid of these
during the preassembly step by looking at the quality of the error
correction. If there is a chimeric seed reed, the short reads won't align
across the junction of the inversion, resulting in a "coverage gap" in the
preassembler alignment. These gaps are identified by a low consensus
quality in the middle of the read. A filtering script splits the read at
this low quality region and trims the ends back to the high quality region.
That way you don't have to get rid of the read entirely and can still make
use of the non-inverted portions.

On Fri, Aug 16, 2013 at 2:50 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:

> On Aug 15, 2013, at 1:34 , Matthew D. Pagel <pagel@xxxxxxx> wrote:
> > Is there a quick-and-dirty algorithm out there for identifying
> inversions from
> > one subread to the next within a single PB read
>
> I'd have a more pressing, but similar question at the moment: is there a
> way of easily identifying reads which for such a FR structure but where the
> PB algorithms apparently did not recognise an adapter?
>
> Background: I'm working on the read improvement routines atm and I think
> that in the 49 PB reads I took as initial test set (out of >30k from the
> E.coli Nature paper), already two reads show such an inversion where there
> should be none … ergo it's a sequencing artefact and 4% of reads like this
> will wreak havoc with most assembly algorithms. I hate situations like
> these.
>
> Bonus question: are PB adaptor sequences listed somewhere on the net? The
> only place I found some are in the metadata XML files, and they told me
>    ATCTCTCTCttttcctcctcctccgttgttgttgttGAGAGAGAT
>
> Are there others?
>
> B.
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>

-- 
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Follow-Ups:
- [mira_talk] Re: PacBio CCS questions
  - From: Bastien Chevreux

References:
- [mira_talk] PacBio CCS questions
  - From: Matthew D. Pagel
- [mira_talk] Re: PacBio CCS questions
  - From: Bastien Chevreux

[mira_talk] Re: PacBio CCS questions

Other related posts: