[mira_talk] Re: Request for comment: sff_extract defaults

Hi:

We have created a new project for the new sff_extract. It is called seq_crumbs.

https://github.com/JoseBlanca/seq_crumbs

The general idea is to include in there small utilities capable of processing sequence files using unix pipes. The idea came from the original Unix pipes and from biopiecies. The main differences with biopiecies is that in this case we pass text sequence files between the binaries and that we want to ease the installation and setup as much as possible.
The project depends quite a lot on Biopython.
The new sff_extract does not have matepairs included, but you can redirect its stdout to another binary named split_matepairs to get the same result. This is only preliminary work (don't expect perfection), but it would be great if some of you could test it. We are open to bug reports and suggestions. Our intention is to provide a useful little tool as the original sff_extract was. We're still open to change interfaces and to include more utilities. We're looking forward to hear from you.

Best regards,

Jose Blanca


On 16/06/12 15:16, Martin Mokrejs wrote:
Bastien Chevreux wrote:

Collecting feedback so far, I'm seeing a number of things in the responses.

First the points which look like having a strong consensus:
1. FASTQ is set. Yes, it would be Sanger style FASTQ.
2. not flipping reads in paired-data also seems to find approval.
3. default clips: less pros than cons, which would also be my choice. Largely 
due to the fact that the last Roche SFFs I've seen *still* used only one clip 
type and no distinction between quality and adaptor is possible there. Ion does 
it right, so we could put on the wish-list a switch to only clip adaptors and 
keep low qual in lowercase.

Definitely, I am all for this optional switch. It is easy with biopython to 
fill in the adapter clip points as we discussed several times on the list here 
and maybe biopython as well. It works fine. And, because one cannot slice the 
SFF objects without loosing flow info one cannot remove the annotated adapter 
from the SFF object/file. One is forced to export into fasta+qual or fastq if 
the goal is to have files with low-qual regions in lowercase *without 
adapters/MIDs* and high-qual devoid adapters/MIDs in uppercase.

sff_extract2 will probably need another command-line switch to know whether 
MIDs/adapters also on the left end are to be removed as well (or not). I think 
mira will have to do it internally after separating samples by their MIDs 
unless sff_extract2 splits the data into multiple files (c.f. 7 below?). So by 
default I think MIDs and adapters will have to be retained on the left end, 
along with the sequencing key.

Other points:
6. MID support: I won't implement that, Jose thinks it's better in a separate 
tool.

Finding MIDs is actually an overlapping task with adapter searches, they have 
to go hand in hand in one sweep. Exact searches for MIDs on the left end miss 
about 2% of MIDs due to sequencing errors. It is a daunting task to get it 
somehow right. It is better to stick to sfffile if user has it available. It 
must be doing some extra magic to unleash more MIDs.

Separate tool is really much more likely to appear.

7. splitting single reads / pairs to different files: good idea. However I 
won't have time for this in the near future.
8. direct support of SFF in MIRA: also good idea but will not materialise 
anytime soon, there are too many other burning things which need my immediate 
attention.

I think you could use some portions of NCBI sra-toolkit libs provided there are 
no license issues.

Martin



--
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: