[mira_talk] Re: Total consensus length is double ?

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 12 Jul 2010 20:40:24 +0200

On Montag 12 Juli 2010 Jan van Haarst wrote:
> I have converted the datasets to SFF using sff-dump from NCBI.
> Like this :
> sff-dump -A SRR001028 -D ena/SRR001028/
> That is using the sff-dump that is in
> http://www.ncbi.nlm.nih.gov/Traces/sra/static/sra_toolkit-1.0.0-b7-glibc.6-
> x86_64.tar.gz

Ah ... then I know what happened. Paired-end at the SRA is not equal to 
paired-end in the SFF. Read below.

> I'll process the assembly with convert_project, and see whether I can get
> numbers that look more realistic.

This will not help ... you will still end up at about twice the expected size 
of the genome. Here's why:

the SFFs you created via sff-dump are not identical in information as the ones 
created by Roche/454. For unpaired reads it's no big deal, but for paired-end 
it's a bit of a catastrophe.

The sequencing technology of Roche creates "paired-end reads" which are, in 
fact, sequenced and stored as one read, just separated by a linker. In theory, 
in practice it's even more troublesome. But the theoretical PE read from 454 
looks like this:

>nameseq1
... 5'-3'seq1 ... linker ... 5'-3'seq2 ...

Now, Roche/454 was too lazy (or they did it on purpose) to provide software to 
actually split this into something more amenable for existing assembly 
programs at the time, this would have been something like this:

>nameseq1_f
... 5'-3'seq1 ...
>nameseq1_r
... 5'-3'seq2 ...

The second "problem" with 454 PE data is, that the paired sequences are in 
5'-3', 5'-3' reading direction, whereas Sanger PE reads are all in 5'-3', 
3'-5' order.

So, when programming the PE functionality to sff_extract, I added a whole 
bunch of extra-checks to catch faulty paired-end data ... and to get the 
direction of the paired-end data right for MIRA by reversing one of the two 
sequences. The result form sff_extract is this:

>nameseq1_f
... 5'-3'seq1 ...
>nameseq1_r
... 3'-5'seq2 ...

The data at the SRA is also cleaned from the linker. However, they do not 
reverse one of the paired-end mates, they store:

>nameseq1_f
... 5'-3'seq1 ...
>nameseq1_r
... 5'-3'seq2 ...

To re-create an equivalent SFF to what Roche/454 dumps out from the machine, 
you would need to re-create the 

  5'-3'seq -- linker -- 3'-5'seq

for all paired reads, the make the SFF, then run sff_extract.

Which is obviously a bit of overkill, because working with the FASTQs from the 
SRA is also entirely feasible with MIRA, though I don't trust the SRA 
splitting of paired-end data as much as I trust the one from sff_extract.

> And I'm of course open for any suggestions.

It's a matter of getting some incantation right. Not documented yet, but I'll 
explain. Let me download the data from the SRA and just try out one or two 
things, then I'll get back to you.

And then I need to document the SRA direction problem ... even more fun when I 
think that Solexa has both types of libraries (5-3,3-5 and 5-3,5-3). Joy.

B.

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: