[mira_talk] Re: SSPACE TAB library information from MIRA caf or maf

  • From: Andreas Leimbach <andreas.leimbach@xxxxxxxxxxxxxxxx>
  • To: Peter Cock <p.j.a.cock@xxxxxxxxxxxxxx>, Robert Willows <robert.willows@xxxxxxxxx>
  • Date: Fri, 14 Feb 2014 11:44:34 +0100

Hi Peter and Robert,

thanks a lot for sharing your scripts! They both look great.

Getting the exact start/stop positions from CIGAR is really a drag, I don't have a better solution than Robert in just using TLEN.

I haven't submerged myself into scaffolding. In my view it doesn't make sense for Illumina paired-end, which I work with. I don't think it gives me any advantage on my bacterial permanent drafts. After all they're assembled with MIRA ;-)!

Cheers,
Andreas

P.S.: I wrote a Perl script to infer Illumina paired-end insert sizes and basic stats from a BAM/SAM file. I just uploaded it to GitHub:
https://github.com/aleimba/bac-genomics-scripts/blob/master/sam_insert-size/sam_insert-size.pl

Until I realized there's a fantastic Picard tool, CollectInsertSizeMetrics.jar, that does just that (and which I'd prefer):
http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

So Peter, maybe there's some useful code in there and you might scavenge it, if you're still interested in updating your python script.

--
Andreas Leimbach
Universität Münster
Institut für Hygiene
Mendelstr. 7
D-48149 Münster
Germany

Tel.: +49 (0)551 39 33843
E-Mail: andreas.leimbach@xxxxxxxxxxxxxxxx

On 14.2.14 10:46, Peter Cock wrote:
On Fri, Feb 14, 2014 at 1:51 AM, Robert Willows
<robert.willows@xxxxxxxxx> wrote:
On 14 Feb 2014, at 12:33 pm, Robert Willows <robert.willows@xxxxxxxxx> wrote:

Hi Peter,

I've attached a shell script for making the TAB files from the miraconvert
produced sam file which has been depadded (instructions for depadding
are given in the script).

Thanks for sharing this.

Note 'samtools depad' can produce SAM output too, although I
prefer to produce and store a BAM file and turn it into SAM on
demand for parsing by piping 'samtools view' to stdin of a script.

The script tries to make 5 libraries and tells you
which ones have paired reads in them.

Your shell scripting is better than mine - there are some steps that I
don't quite get, but I can follow most of the logic.

I had started writing this in Python yesterday (still an unfinished
work in progress), but suspect for the assemblies I'm working on
the Illumina pairs are too short to help much with scaffolding:

https://github.com/peterjc/picobio/blob/master/sambam/sam_to_sspace_tab.py

One of the trickier parts is getting the end position of the mapping
(you have to parse the CIGAR string, which means handling both
lines for a read pair together - thus it is easier to work with a name
sorted file). Your code seems to instead take the simplification of
using len(SEQ), which is usually a reasonable approximation.

Producing one tab file for each read group was on my TO DO list,
although I may give up scaffolding for this particular project...

Also, I just found a problem in the sam file produced in the version 4.0
of miraconvert when making the sam file from the maf. So you need
to use miraconvert from version 4.0rc4 (release candidate #4) to
make the sam file which makes the sam file correctly. I'll post a new
thread about this.

Oops. My mistake 4.0 and 4.0rc4 sam files are both OK.
Sorry.

A false alarm - that's good :)

Peter


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: