Hi Nick, We have had an identical problem here with torrent resulting file size. As we often run on our local cluster which has a time limit, the assemblies get stopped when that ends. I've created a quick perl script that pulls out a random subset of sequences from the fastq file. You can select the end percent to reduce the coverage to, e.g. 80%, etc. The script is attached. Complete agreement that selecting on quality will be a significant improvement, but that one is still pending. There are several other tools that can relative qualities of a fastq file, such as Trim Galore. I haven't tested it out yet, but that's also on my ToDo list. http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ -Benjamin- -------------------------------- Benjamin Leopold BioInformatiker benjamin.leopold@xxxxxxxxxxxxxxx Poliklinik für Parodontologie Universitätsklinikum Münster Waldeyerstraße 30, 48149 Münster T +49(0)251-83-49882 -------------------------------- On 08/03/2012 09:41 AM, Nicholas Heng wrote: > > Hi Bastien and other MIRA-using Folk, > > I have a dataset from an Ion Torrent 318 run that's about 480 Mb in size > (average readlength 177 bp). My underpowered 8 GB bioinformatics machine > spent 7 solid days churning away at the data using MIRA 3.4 to no avail. > > Can any of you next-gen sequencing gurus please suggest a program that would > allow me, one with absolutely no programming skill, to filter the dataset > into say <150 bp and >150 bp subsets such that it's suitable for MIRA to > handle? Alternatively, a program that would sort by sequence quality... but > this may be harder. > > The bacterial genome I'm sequencing is de novo (no reference) and it's about > 2 - 2.5 Mb in size. There is a 454 run being done but I'd like subsets of > Ion Torrent for a hybrid assembly with MIRA. And no, we can't afford a more > accurate Illumina or SOLiD run at the present time. > > Any help is greatly appreciated. > > Cheers, > Nick. > > > ================================ > Nicholas (Nick) C.K. Heng, Ph.D. > Department of Oral Sciences > Faculty of Dentistry > University of Otago > P.O. Box 647 > Dunedin 9054 > NEW ZEALAND. > Ph: +643 4799254 > Fx: +643 4797078 > ================================ >
#!/usr/bin/perl #=============================================================================== # FILE: reduce_seqfile_coverage.pl # USAGE: ./reduce_seqfile_coverage.pl -f sequence_filename [-n integer | -p percent] # [-o output file name] [-fmt file format (e.g. fastq)] # # DESCRIPTION: picks out random set (default 1/4) of the read results from the fastq file, # then prints them out to a new file # # TODO: Mod script to calc average qual of each seq and choose fraction of highest quals. (Using Bio::Seq::Quality ?) # TODO: Add option for use with paired end seq files. (Currently only single-end.) # TODO: Allow more sequence filetypes; currently only fasta & fastq # # REQUIREMENTS: BioPerl, Pod, Getopt, File # AUTHOR: B.Leopold (cometsong) # ORGANIZATION: Universitaetklinikum Muenster # CREATED: 2011-10-18 #=============================================================================== use strict; use warnings; use Bio::SeqIO; use Pod::Usage; use Getopt::Long; use File::Basename; # sets autoflush to "true" for STDOUT and the progress output will not be buffered $| = 1; # check for cmd line arguments if (scalar @ARGV == 0) { pod2usage(-verbose => 1);} # vars used in script my ($base_file_name, $dirname); # hold pieces of sequence file name my $outfile; # hold concatenated prefix and basename my ($inSeq, $outSeq); # seqio filename refs my $seq_count = 0; # count of sequences as processed, selected, written to outfile my $total_seqs; # total num of seqs in file my $one_percent_seqs; # 1% of total_seqs for comparison # set default vals for options my $Opts = { 'filename' => undef, 'out_prefix' => 'reduced', 'denominator' => '4', 'percent' => '0', # default=0 for percent, as having a value will take precedence over num_of_seqs 'format' => 'fastq' }; # go get 'em GetOptions($Opts, 'filename|s=s', 'out_prefix|o=s', 'percent|p=f', 'denominator|d=i', 'format|fmt|f=s' ) || pod2usage(-verbose => 1); # if percent not passed as arg, calculate it from passed (or default) denominator $Opts->{'percent'} = $Opts->{'percent'} ? $Opts->{'percent'} : 1/$Opts->{'denominator'}; # split dirname and basename of incoming seq filename $base_file_name = basename($Opts->{'filename'}); $dirname = dirname($Opts->{'filename'}); # concatenate "out_prefix.basefile_name" $outfile = "$Opts->{'out_prefix'}.$base_file_name"; # Log to STDOUT the beginning of the run. print "Random selection of a percentage of the sequence file. \n"; # print out all opt values print join("\n\t", "\tin file : ". $Opts->{'filename'}, # "out prefix : ". $Opts->{'out_prefix'}, "out file : ". $outfile, # "fraction : 1/". $Opts->{'denominator'}, "percent : ". $Opts->{'percent'}, "file format: ". $Opts->{'format'} ); print "-> Beginning Time: ". localtime(time) ."\n"; # opening specified sequence files $inSeq = Bio::SeqIO->new( '-file' => "<$Opts->{'filename'}" # incoming seq file , '-format' => $Opts->{'format'} ); $outSeq = Bio::SeqIO->new( '-file' => ">$outfile" # output seq file , '-format' => $Opts->{'format'} ); # calc total seqs for progress printout (for debug!) # (varies by seqfile format; right now only setup for fastq and fasta formats) if ( $Opts->{'format'} eq 'fasta' ) { $total_seqs = scalar grep( "^>", file_contents_to_array($Opts->{'filename'}) ); } elsif ( $Opts->{'format'} eq 'fastq' ) { $total_seqs = scalar grep( "^@[^@]", file_contents_to_array($Opts->{'filename'}) ); } else { $total_seqs = 1000; } # default value. $one_percent_seqs = roundnum( $total_seqs / 100 ); print "Total Seqs: ", $total_seqs, "\n"; #(for debug!) print "One percent: ", $one_percent_seqs, "\n"; print "Progress: "; # read next_seq, check (calculated or passed) percent, write_seq if within bounds if ( $Opts->{'percent'} > 0.0 ) { while( my $seq = $inSeq->next_seq ) { $seq_count++; if ( int(rand(101)) <= $Opts->{'percent'}) { $outSeq->write_seq($seq); } print "." if ( ( $seq_count % $one_percent_seqs ) == 0 ) ; } print "\n"; } print "-> Ending Time: ". localtime(time) ."\n"; # =item roundnum() # # This function is passed a scalar with numeric value, then # returns rounded off value based on first decimal value. # If passed scalar is in scientific notation (e.g. 4.34344-05) # then it returns 0. # =cut sub roundnum { my ( $num ) = shift; my ( $int, $decimal ) = split(/\./,$num); my ( $round ); if ( $decimal =~ /-/ ) { # num is sci notation! $round = 0; } else { # check first decimal place < 5 $round = substr($decimal, 0, 1) < 5 ? $int : $int + 1 ; } return $round; } # =item file_contents_to_array() # Read contents of file (passed filename) into an array of chomped lines # e.g. my @linearr = file_contents_to_array($filename); # =cut sub file_contents_to_array { my ( $filename ) = @_; my $FILE = fileopen( "<$filename" ); my @lines; while ( my $line = <$FILE> ) { chomp $line; push @lines, $line; } return @lines; } # =item fileopen() # open file by name, return filehandle scalar # e.g. my $filehandle = fileopen(">".$filename); # =cut sub fileopen { open my $fh, "@_" or die "$0 : failed to open file '@_' : $!\n"; return $fh; } =pod =head1 NAME reduce_seqfile_coverage.pl =head1 SYNOPSIS B<reduce_seqfile_coverage.pl> B<-s sequence_filename> [-d integer | -p decimal or integer] [-o output file name prefix] [-f file format (e.g. fastq)] -s filename of the sequencing file to pick the fraction from -d denominator of the fraction for sequence subset size ( i.e. '-n 4'=1/4, 25=1/25, etc ), default: 4 -p percentage of the sequence file to keep (This having a value higher than '0' takes precedence over any passed denominator!) -o prefix for output filename, default: fractional -f format of sequence input and output sequence files, default: fastq IMPORTANT: Do NOT use on Paired End sequence files! This script randomly selects from the sequence data!