[mira_talk] Fwd: Re: Filtering (size or quality) for Ion Torrent data

From: Benjamin Leopold <benjamin.leopold@xxxxxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Fri, 03 Aug 2012 10:37:12 +0200

Hi Nick,

We have had an identical problem here with torrent resulting file size.
As we often run on our local cluster which has a time limit, the
assemblies get stopped when that ends.

I've created a quick perl script that pulls out a random subset of
sequences from the fastq file.  You can select the end percent to reduce
the coverage to, e.g. 80%, etc.  The script is attached.

Complete agreement that selecting on quality will be a significant
improvement, but that one is still pending.

There are several other tools that can relative qualities of a fastq
file, such as Trim Galore.  I haven't tested it out yet, but that's also
on my ToDo list.
    http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

-Benjamin-

--------------------------------
Benjamin Leopold
BioInformatiker
benjamin.leopold@xxxxxxxxxxxxxxx
Poliklinik für Parodontologie
Universitätsklinikum Münster
Waldeyerstraße 30, 48149 Münster
T +49(0)251-83-49882
--------------------------------

On 08/03/2012 09:41 AM, Nicholas Heng wrote:
> 
> Hi Bastien and other MIRA-using Folk,
> 
> I have a dataset from an Ion Torrent 318 run that's about 480 Mb in size 
> (average readlength 177 bp).  My underpowered 8 GB bioinformatics machine 
> spent 7 solid days churning away at the data using MIRA 3.4 to no avail.
> 
> Can any of you next-gen sequencing gurus please suggest a program that would 
> allow me, one with absolutely no programming skill, to filter the dataset 
> into say <150 bp and >150 bp subsets such that it's suitable for MIRA to 
> handle?  Alternatively, a program that would sort by sequence quality... but 
> this may be harder.
> 
> The bacterial genome I'm sequencing is de novo (no reference) and it's about 
> 2 - 2.5 Mb in size.  There is a 454 run being done but I'd like subsets of 
> Ion Torrent for a hybrid assembly with MIRA.  And no, we can't afford a more 
> accurate Illumina or SOLiD run at the present time.
> 
> Any help is greatly appreciated.
> 
> Cheers,
> Nick.
> 
> 
> ================================
> Nicholas (Nick) C.K. Heng, Ph.D.
> Department of Oral Sciences
> Faculty of Dentistry
> University of Otago
> P.O. Box 647
> Dunedin 9054
> NEW ZEALAND.
> Ph: +643 4799254
> Fx: +643 4797078
> ================================
>

#!/usr/bin/perl 
#===============================================================================
#         FILE: reduce_seqfile_coverage.pl
#        USAGE: ./reduce_seqfile_coverage.pl -f sequence_filename [-n integer | 
-p percent] 
#                           [-o output file name] [-fmt file format (e.g. 
fastq)]
#
#  DESCRIPTION: picks out random set (default 1/4) of the read results from the 
fastq file,
#               then prints them out to a new file
#
#         TODO: Mod script to calc average qual of each seq and choose fraction 
of highest quals. (Using Bio::Seq::Quality ?)
#         TODO: Add option for use with paired end seq files.  (Currently only 
single-end.)
#         TODO: Allow more sequence filetypes; currently only fasta & fastq
#
# REQUIREMENTS: BioPerl, Pod, Getopt, File
#       AUTHOR: B.Leopold (cometsong)
# ORGANIZATION: Universitaetklinikum Muenster
#      CREATED: 2011-10-18
#===============================================================================

use strict;
use warnings;
use Bio::SeqIO;
use Pod::Usage;
use Getopt::Long;
use File::Basename;

# sets autoflush to "true" for STDOUT and the progress output will not be 
buffered
$| = 1;

# check for cmd line arguments
if (scalar @ARGV == 0) { pod2usage(-verbose => 1);}

# vars used in script
my ($base_file_name, $dirname); # hold pieces of sequence file name
my $outfile;                    # hold concatenated prefix and basename
my ($inSeq, $outSeq);           # seqio filename refs
my $seq_count = 0;              # count of sequences as processed, selected, 
written to outfile
my $total_seqs;                 # total num of seqs in file
my $one_percent_seqs;           # 1% of total_seqs for comparison

# set default vals for options
my $Opts = {
    'filename'    => undef,
    'out_prefix'  => 'reduced',
    'denominator' => '4',
    'percent'     => '0',     # default=0 for percent, as having a value will 
take precedence over num_of_seqs
    'format'      => 'fastq'
    };

# go get 'em
GetOptions($Opts,
    'filename|s=s',
    'out_prefix|o=s',
    'percent|p=f',
    'denominator|d=i',
    'format|fmt|f=s'
    ) || pod2usage(-verbose => 1);

# if percent not passed as arg, calculate it from passed (or default) 
denominator
$Opts->{'percent'} = $Opts->{'percent'} ? $Opts->{'percent'} : 
1/$Opts->{'denominator'};

# split dirname and basename of incoming seq filename
$base_file_name = basename($Opts->{'filename'});
$dirname        = dirname($Opts->{'filename'});
# concatenate "out_prefix.basefile_name"
$outfile = "$Opts->{'out_prefix'}.$base_file_name";


# Log to STDOUT the beginning of the run.
print "Random selection of a percentage of the sequence file. \n";

# print out all opt values
print  join("\n\t",
  "\tin file    :  ". $Opts->{'filename'},
#   "out prefix :  ". $Opts->{'out_prefix'},
    "out file   :  ". $outfile,
#   "fraction   :  1/". $Opts->{'denominator'},
    "percent    :  ". $Opts->{'percent'},
    "file format:  ". $Opts->{'format'}
    );
print "-> Beginning Time: ". localtime(time) ."\n";

# opening specified sequence files
$inSeq  = Bio::SeqIO->new( '-file'   => "<$Opts->{'filename'}"  # incoming seq 
file
                         , '-format' =>   $Opts->{'format'} );
$outSeq = Bio::SeqIO->new( '-file'   => ">$outfile"             # output seq 
file
                         , '-format' =>   $Opts->{'format'} );

# calc total seqs for progress printout (for debug!)
# (varies by seqfile format; right now only setup for fastq and fasta formats)
if ( $Opts->{'format'} eq 'fasta' ) {
    $total_seqs = scalar grep( "^>", 
file_contents_to_array($Opts->{'filename'}) );
}
elsif ( $Opts->{'format'} eq 'fastq' ) {
    $total_seqs = scalar grep( "^@[^@]", 
file_contents_to_array($Opts->{'filename'}) );
}
else { $total_seqs = 1000; } # default value.
$one_percent_seqs = roundnum( $total_seqs / 100 );
print "Total Seqs:  ", $total_seqs, "\n";
#(for debug!) print "One percent: ", $one_percent_seqs, "\n";


print "Progress: ";
# read next_seq, check (calculated or passed) percent, write_seq if within 
bounds
if ( $Opts->{'percent'} > 0.0 ) {
    while( my $seq = $inSeq->next_seq ) {
        $seq_count++;
        if ( int(rand(101)) <= $Opts->{'percent'}) {
            $outSeq->write_seq($seq);
        }
        print "." if ( ( $seq_count % $one_percent_seqs ) == 0 ) ;
    }
    print "\n";
}
print "-> Ending Time: ". localtime(time) ."\n";


# =item roundnum()
# 
# This function is passed a scalar with numeric value, then
# returns rounded off value based on first decimal value.  
# If passed scalar is in scientific notation (e.g. 4.34344-05)
# then it returns 0.
# =cut
sub roundnum {
    my ( $num ) = shift;
    my ( $int, $decimal ) = split(/\./,$num);
    my ( $round );

    if ( $decimal =~ /-/ ) {    # num is sci notation!
        $round = 0;
    } else {                    # check first decimal place < 5
        $round = substr($decimal, 0, 1) < 5 ? $int : $int + 1 ;
    }
    return $round;
} 

# =item file_contents_to_array()
# Read contents of file (passed filename) into an array of chomped lines
#   e.g. my @linearr = file_contents_to_array($filename);
# =cut
sub file_contents_to_array {
    my ( $filename ) = @_;
    my $FILE = fileopen( "<$filename" );
    my @lines;

    while ( my $line = <$FILE> ) {
        chomp $line;
        push @lines, $line;
    }
    return @lines;
}

# =item fileopen()
# open file by name, return filehandle scalar
#   e.g. my $filehandle = fileopen(">".$filename);
# =cut
sub fileopen {
    open my $fh, "@_"
        or die "$0 : failed to open file '@_' : $!\n";
    return $fh;
}


=pod

=head1 NAME

reduce_seqfile_coverage.pl

=head1 SYNOPSIS

B<reduce_seqfile_coverage.pl> B<-s sequence_filename> 
                        [-d integer | -p decimal or integer] 
                        [-o output file name prefix] 
                        [-f file format (e.g. fastq)]

    -s      filename of the sequencing file to pick the fraction from
    -d      denominator of the fraction for sequence subset size ( i.e. '-n 
4'=1/4, 25=1/25, etc ), default: 4
    -p      percentage of the sequence file to keep (This having a value higher 
than '0' takes precedence over any passed denominator!)
    -o      prefix for output filename, default: fractional
    -f      format of sequence input and output sequence files,  default: fastq

    IMPORTANT: Do NOT use on Paired End sequence files!  This script randomly 
selects from the sequence data!

Follow-Ups:
- [mira_talk] Re: Fwd: Re: Filtering (size or quality) for Ion Torrent data
  - From: Shankar Manoharan

[mira_talk] Fwd: Re: Filtering (size or quality) for Ion Torrent data

Other related posts: