[mira_talk] Re: new sff_extract

  • From: Jose Blanca <jblanca@xxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 31 Oct 2012 11:04:18 +0100

On 31/10/12 10:48, Peter Cock wrote:
On Wed, Oct 31, 2012 at 9:39 AM, Jose Blanca<jblanca@xxxxxx>  wrote:
Hi:

Sometime ago we discussed in this list the future of sff_extract. We started
working on it and we have a version that we think is working.
The sff_extract functionality has been split in two sff_extract and
split_matepairs that can be linked together with a pipe. We haven't done
extensive testing so if you use them, please let us know.
These utilities are bundled with some other little tools that we have
developed for our day to day work. They are all written in python and they
use biopython.
You can take a look at the development site:

https://github.com/JoseBlanca/seq_crumbs

Or our site:

http://bioinf.comav.upv.es/seq_crumbs/

Of course we'd love to have some feedback.
Best regards,

Hi Jose,

That looks very interesting - I'll forward this to the Biopython
list.

Great, I'm also on the Biopython list.

For those not aware of this, the Biopython SFF code was
based on Jose's original work for sff_extract - then reworked
as part of the Biopython parsing framework, made Python 3
compatible etc.

Jose - Is there anything you found missing in the Biopython
SFF code? For example a public API to get at the low-level
information from an SFF file rather than as Biopython objects?

Not really, because we only write the fastq.
Maybe we should talk about the Biopython API in biopython-devel, but have had some minor grips with the API (on small details): - when a sequence with no description is read from a file the "no description" is added as the description. That's a problem when you write the file back. We have work around that by setting the description in that case to be the same as the id. Although in my opinion it would be better to have the option to set the description to None. - It's not possible to modify the seq of a SeqRecord if the SeqRecord has per_letter_annotations even if the new sequence has the correct length. - The fastq indexers break down with some pair ends files because they have repeated ids. We have work around that by modifiying the indexers to work with the whole title lines.

I think that's it, in general Biopython is great and I'm looking forward to have the new SearchIO and GFF stuff integrated in it.
Best regards,

Jose Blanca

Thanks,

Peter



--
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: