[mira_talk] Re: Request for comment: sff_extract defaults
- From: Jose Blanca <jblanca@xxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Wed, 04 Jul 2012 12:51:07 +0200
Hi:
We have created a new project for the new sff_extract. It is called
seq_crumbs.
https://github.com/JoseBlanca/seq_crumbs
The general idea is to include in there small utilities capable of
processing sequence files using unix pipes. The idea came from the
original Unix pipes and from biopiecies. The main differences with
biopiecies is that in this case we pass text sequence files between the
binaries and that we want to ease the installation and setup as much as
possible.
The project depends quite a lot on Biopython.
The new sff_extract does not have matepairs included, but you can
redirect its stdout to another binary named split_matepairs to get the
same result.
This is only preliminary work (don't expect perfection), but it would be
great if some of you could test it. We are open to bug reports and
suggestions. Our intention is to provide a useful little tool as the
original sff_extract was.
We're still open to change interfaces and to include more utilities.
We're looking forward to hear from you.
Best regards,
Jose Blanca
On 16/06/12 15:16, Martin Mokrejs wrote:
Bastien Chevreux wrote:
Collecting feedback so far, I'm seeing a number of things in the responses.
First the points which look like having a strong consensus:
1. FASTQ is set. Yes, it would be Sanger style FASTQ.
2. not flipping reads in paired-data also seems to find approval.
3. default clips: less pros than cons, which would also be my choice. Largely
due to the fact that the last Roche SFFs I've seen *still* used only one clip
type and no distinction between quality and adaptor is possible there. Ion does
it right, so we could put on the wish-list a switch to only clip adaptors and
keep low qual in lowercase.
Definitely, I am all for this optional switch. It is easy with biopython to
fill in the adapter clip points as we discussed several times on the list here
and maybe biopython as well. It works fine. And, because one cannot slice the
SFF objects without loosing flow info one cannot remove the annotated adapter
from the SFF object/file. One is forced to export into fasta+qual or fastq if
the goal is to have files with low-qual regions in lowercase *without
adapters/MIDs* and high-qual devoid adapters/MIDs in uppercase.
sff_extract2 will probably need another command-line switch to know whether
MIDs/adapters also on the left end are to be removed as well (or not). I think
mira will have to do it internally after separating samples by their MIDs
unless sff_extract2 splits the data into multiple files (c.f. 7 below?). So by
default I think MIDs and adapters will have to be retained on the left end,
along with the sequencing key.
Other points:
6. MID support: I won't implement that, Jose thinks it's better in a separate
tool.
Finding MIDs is actually an overlapping task with adapter searches, they have
to go hand in hand in one sweep. Exact searches for MIDs on the left end miss
about 2% of MIDs due to sequencing errors. It is a daunting task to get it
somehow right. It is better to stick to sfffile if user has it available. It
must be doing some extra magic to unleash more MIDs.
Separate tool is really much more likely to appear.
7. splitting single reads / pairs to different files: good idea. However I
won't have time for this in the near future.
8. direct support of SFF in MIRA: also good idea but will not materialise
anytime soon, there are too many other burning things which need my immediate
attention.
I think you could use some portions of NCBI sra-toolkit libs provided there are
no license issues.
Martin
--
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: