Hi Peter and Robert, thanks a lot for sharing your scripts! They both look great.Getting the exact start/stop positions from CIGAR is really a drag, I don't have a better solution than Robert in just using TLEN.
I haven't submerged myself into scaffolding. In my view it doesn't make sense for Illumina paired-end, which I work with. I don't think it gives me any advantage on my bacterial permanent drafts. After all they're assembled with MIRA ;-)!
Cheers, AndreasP.S.: I wrote a Perl script to infer Illumina paired-end insert sizes and basic stats from a BAM/SAM file. I just uploaded it to GitHub:
https://github.com/aleimba/bac-genomics-scripts/blob/master/sam_insert-size/sam_insert-size.plUntil I realized there's a fantastic Picard tool, CollectInsertSizeMetrics.jar, that does just that (and which I'd prefer):
http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetricsSo Peter, maybe there's some useful code in there and you might scavenge it, if you're still interested in updating your python script.
-- Andreas Leimbach Universität Münster Institut für Hygiene Mendelstr. 7 D-48149 Münster Germany Tel.: +49 (0)551 39 33843 E-Mail: andreas.leimbach@xxxxxxxxxxxxxxxx On 14.2.14 10:46, Peter Cock wrote:
On Fri, Feb 14, 2014 at 1:51 AM, Robert Willows <robert.willows@xxxxxxxxx> wrote:On 14 Feb 2014, at 12:33 pm, Robert Willows <robert.willows@xxxxxxxxx> wrote:Hi Peter, I've attached a shell script for making the TAB files from the miraconvert produced sam file which has been depadded (instructions for depadding are given in the script).Thanks for sharing this. Note 'samtools depad' can produce SAM output too, although I prefer to produce and store a BAM file and turn it into SAM on demand for parsing by piping 'samtools view' to stdin of a script.The script tries to make 5 libraries and tells you which ones have paired reads in them.Your shell scripting is better than mine - there are some steps that I don't quite get, but I can follow most of the logic. I had started writing this in Python yesterday (still an unfinished work in progress), but suspect for the assemblies I'm working on the Illumina pairs are too short to help much with scaffolding: https://github.com/peterjc/picobio/blob/master/sambam/sam_to_sspace_tab.py One of the trickier parts is getting the end position of the mapping (you have to parse the CIGAR string, which means handling both lines for a read pair together - thus it is easier to work with a name sorted file). Your code seems to instead take the simplification of using len(SEQ), which is usually a reasonable approximation. Producing one tab file for each read group was on my TO DO list, although I may give up scaffolding for this particular project...Also, I just found a problem in the sam file produced in the version 4.0 of miraconvert when making the sam file from the maf. So you need to use miraconvert from version 4.0rc4 (release candidate #4) to make the sam file which makes the sam file correctly. I'll post a new thread about this.Oops. My mistake 4.0 and 4.0rc4 sam files are both OK. Sorry.A false alarm - that's good :) Peter
-- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html