[mira_talk] Re: Getting coverage by each technology per each contig

From: Martin MOKREJŠ <mmokrejs@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Wed, 21 May 2014 21:18:32 +0200

Ehm, version with correct example output attached.

Martin MOKREJŠ wrote:
> Hi Bastien,
>   thank your for encouraging me to dive into this myself. ;) Was quite 
> simple. Attached is what
> I stitched in a half an hour, it works for my purpose although some bits 
> should be moved to
> separate functions. Anyway, if you want you can include it in the mira 
> bundle, unless you are going
> to write the "whole" thing yourself (which would be a good idea).
>   The numpy dependency is not ideal, for doing using median() and average() 
> it is not pretty. But
> in overall, it works.
> 
>   Would *_contigstats.txt contain one more column with the sequencing 
> technology abbreviated, things
> would have been even easier.
> 
> Martin
> 
> Bastien Chevreux wrote:
>> On 20 May 2014, at 12:27 , Martin MOKREJŠ <mmokrejs@xxxxxxxxx> wrote:
>>>  did anybody try to write some script to calculate how many reads were used 
>>> for each resulting contig? Or even better, getting coverage by each 
>>> technology for a given contig? I think it would be helpful to have these in 
>>> the resulting FASTA files. I think it was already mentioned on this list 
>>> but I just cannot find it.
>>
>> I do not remember seeing this on the list. I hope that this isn’t a sign of 
>> my memory starting to fail me … :-)
>>
>>>  What I am really after is to split the resulting contigs into those 
>>> specific to one or another technology. Looks mira_convert cannot do this 
>>> right away but I hope to collect the contig names first and then ask just 
>>> for them. So, in two executions could do the job I think.
>>
>> There is indeed nothing you could use out of the box from the MIRA package.
>>
>> The easiest solution I can come up with atm is if you parse the contigreads 
>> file in the info directory and categorise reads using a hopefully common 
>> name identifier per technology. That would enable you to create lists of 
>> contigs which are formed by one technology only (or predominantly by one 
>> technology if you want).
>>
>> If that does not work for you, then you would need to parse the MAF file. 
>> The documentation for it in the MIRA docs is only for MAF v1, but MAF v2 has 
>> not changed much, it just added the concept of readgroups and a couple of 
>> other, quite minor changes. Just ask if you need help with that.

#! /usr/bin/env python

"""Calculate some basic statistics of contigs separated by sequencing 
technology. Currently it supports
only Illumina vs. 454 discrimination. Addition of other technologies is simple.


The program reads these two fiels created by mira (4.0.2 version tested):

  '$prefix'_contigreadlist.txt
  '$prefix'_contigstats.txt


The program creates 3 output files: 
  '$prefix'_454_contigreadlist.txt
  '$prefix'_Illumina_contigreadlist.txt
  '$prefix'_mixed_contigreadlist.txt


Finally, it writes on STDOUT something like the following:

454-specific contigs:         42451;  median and average reads per contig:  2,  
 6.39;  median and average contig length: 606, 675.06
Illumina-specific contigs:   102629;  median and average reads per contig:  8,  
54.40;  median and average contig length: 192, 380.18
Covered by both technologies: 75785;  median and average reads per contig: 15, 
113.50;  median and average contig length: 522, 684.61
All contigs in the assembly: 220865;  median and average reads per contig:  6,  
47.37;  median and average contig length: 319, 433.58


Usage:
Change your working directory to the *_d_info directory created by mira, and 
execute this utility.

mira_contigs_stats.py --prefix=Myproject_454_plus_Illumina_info 
--prefixes-454='GOCZ8YT01 GOCZ8YT02' --prefixes-illumina='81N0DABXX'
"""

myversion = 20140521

import os
from numpy import median, average
from optparse import OptionParser

myparser = OptionParser(version="%s version %s, written by Martin Mokrejs" % 
('%prog', myversion))
myparser.add_option("--prefixes-454", action="store", type="string", 
dest="prefixes_454", default='',
    help="Space separated prefixes or readnames originating from Roche 
454-technology.")
myparser.add_option("--prefixes-illumina", action="store", type="string", 
dest="prefixes_illumina", default='',
    help="Space separated prefixes of Illumina-specific reads.")
myparser.add_option("--debug", action="store", type="int", dest="debug", 
default=0,
    help="Set debug to some value")
myparser.add_option("--prefix", action="store", type="string", 
dest="mira_info_prefix", default='',
    help="Filename prefix used by mira for *_d_info/ directory contents.")

(myoptions, myargs) = myparser.parse_args()


def split_argument_values(somestring):
    "Split the string with values separated by spaces into a list."

    if not somestring:
        return []
    elif ' ' in somestring.strip():
        return somestring.strip().split()
    else:
        return [ somestring ]


if __name__ == "__main__":
    if not myoptions.mira_info_prefix:
        raise RuntimeError, "Please provide --prefix so that $prefix + 
_contigreadlist.txt can be merged into a resulting string."

    if not os.path.exists(myoptions.mira_info_prefix + '_contigreadlist.txt'):
        raise RuntimeError, "Cannot find file %s" % myoptions.mira_info_prefix 
+ '_contigreadlist.txt'
    _contigreadlist = open(myoptions.mira_info_prefix + '_contigreadlist.txt')

    myoptions.prefixes_454 = split_argument_values(myoptions.prefixes_454)
    myoptions.prefixes_illumina = 
split_argument_values(myoptions.prefixes_illumina)

    _illumina_contigs = {}
    _454_contigs = {}
    _all_contigs = {}

    while 1:
        _line = _contigreadlist.readline()
        if not _line:
            break

        if _line[0] != '#':
            _contigname, _readname = _line.rstrip().split()
            if filter(lambda _prefix: _readname.startswith(_prefix), 
myoptions.prefixes_illumina):
                if _contigname in _illumina_contigs:
                    _illumina_contigs[_contigname] += 1
                else:
                    _illumina_contigs[_contigname] = 1
            elif filter(lambda _prefix: _readname.startswith(_prefix), 
myoptions.prefixes_454):
                if _contigname in _454_contigs:
                    _454_contigs[_contigname] += 1
                else:
                    _454_contigs[_contigname] = 1
            else:
                # add here some other technology
                pass

            try:
                _all_contigs[_contigname] += 1
            except:
                _all_contigs[_contigname] = 1

    _contigstats = open(myoptions.mira_info_prefix + '_contigstats.txt')
    _contigsizes = {}
    while 1:
        _line = _contigstats.readline()
        if not _line:
            break

        if _line[0] != '#':
            _contigname, _contigsize = _line.split()[:2]
            _contigsizes[_contigname] = int(_contigsize)



    _outfile = open(myoptions.mira_info_prefix + '_454_contigreadlist.txt', 'w')
    map(lambda x: _outfile.write(x + '\n'), _illumina_contigs)
    _outfile.close()

    _outfile = open(myoptions.mira_info_prefix + 
'_Illumina_contigreadlist.txt', 'w')
    map(lambda x: _outfile.write(x + '\n'), _454_contigs)
    _outfile.close()

    _454 = set(_454_contigs.keys())
    _illumina = set(_illumina_contigs.keys())

    _mixed = _454.intersection(_illumina)
    _mixed_contigs = {}
    _handle = map(lambda x: _mixed_contigs.__setitem__(x, _454_contigs[x] + 
_illumina_contigs[x]), _mixed)

    _outfile = open(myoptions.mira_info_prefix + '_mixed_contigreadlist.txt', 
'w')
    map(lambda x: _outfile.write(x + '\n'), _mixed_contigs)
    _outfile.close()

    _454_specific = _454 - _illumina
    _illumina_specific = _illumina - _454

    print "454-specific contigs: %d; median and average reads per contig: %d, 
%.2f; median and average contig length: %d, %.2f" % (len(_454_specific), 
median(_454_contigs.values()), average(_454_contigs.values()), 
median([_contigsizes[x] for x in _454_contigs.keys()]), 
average([_contigsizes[x] for x in _454_contigs.keys()]))
    print "Illumina-specific contigs: %d; median and average reads per contig: 
%d, %.2f; median and average contig length: %d, %.2f" % 
(len(_illumina_specific), median(_illumina_contigs.values()), 
average(_illumina_contigs.values()), median([_contigsizes[x] for x in 
_illumina_contigs.keys()]), average([_contigsizes[x] for x in 
_illumina_contigs.keys()]))
    print "Covered by both technologies: %d; median and average reads per 
contig: %d, %.2f; median and average contig length: %d, %.2f" % (len(_mixed), 
median(_mixed_contigs.values()), average(_mixed_contigs.values()), 
median([_contigsizes[x] for x in _mixed_contigs.keys()]), 
average([_contigsizes[x] for x in _mixed_contigs.keys()]))
    print "All contigs in the assembly: %d; median and average reads per 
contig: %d, %.2f; median and average contig length: %d, %.2f" % 
(len(_all_contigs), median(_all_contigs.values()), 
average(_all_contigs.values()), median(_contigsizes.values()), 
average(_contigsizes.values()))

References:
- [mira_talk] Getting coverage by each technology per each contig
  - From: Martin MOKREJŠ
- [mira_talk] Re: Getting coverage by each technology per each contig
  - From: Bastien Chevreux
- [mira_talk] Re: Getting coverage by each technology per each contig
  - From: Martin MOKREJŠ

[mira_talk] Re: Getting coverage by each technology per each contig

Other related posts: