[mira_talk] Re: mira denovo assembly of solexa reads

  • From: Lionel Guy <guy.lionel@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 17 Jan 2011 15:51:09 +0100

I would take between an 8th and a 5th of your data. You could then either map 
the rest of your data to the de novo assembly to increase the quality, or do 
multiple assemblies with multiple subsets and compare the resulting assemblies.

To subset (provided that you have fastq files, one line per sequence), you can:
- use 'head -n X file1.fastq > file1_filt.fastq', where X is the number of 
reads you want * 4 (don't forget to do that on both files).
- use a script to do that. Attached is the perl script I use to do that. One 
possible command would be
perl fastqSampler.pl -q file1.fastq -l 8 > filt1_filt.fastq

Cheers,

Lionel

#!/usr/bin/perl -w
use strict;
use Getopt::Std;

=head1 SYNOPSIS

fastqSampler.pl - Subsample a fastq file

=head1 USAGE

fastqSampler.pl [-l loop] [-n numbers] [-f format] -q fastq_file [-b]

=head1 INPUT

=head2 -l loop

Numeric. Size of the loop (i.e, take x samples every loop samples). 2 by 
default.

=head2 -n number

One or several integers separated by commas (e.g. 4,5,9). In combination with 
the previous, gives: every loop samples, take n1,n2,n3. 1 by default.

=head2 -q fastq_file

A Solexa fastq file (i.e. quality offset starts at 64)

=head2 -b

Use Bioperl? (slow)

=head1 OUTPUT

Fasta and fasta qual files

=head1 AUTHOR

Lionel Guy (lionel.guy@xxxxxxxxx)

=head1 DATE

Wed Dec  9 09:33:47 CET 2009

=cut

# arguments
our $opt_l = 2;
our $opt_n = 1;
our $opt_q;
our $opt_b;
getopts('q:l:n:b');
usage() unless $opt_q;
my @n = split(/,/, $opt_n);
my %n;
# check that none of the numbers is greater than n, put it into a hash
print STDERR "Subsetting sequences " . join(", ", @n) . " every $opt_l\n";
foreach (@n){
    die "$_ is bigger than $opt_l\n" if ($_ > $opt_l);
    $n{$_} = 1;
}

# counters
my $count = 0;
my $n = 0;

# non-bioperl system
if (!$opt_b){
    # change record separator
    local $/ = '@';
    # open files
    open IN, $opt_q || die;
    # discard initial line
    <IN>;
    my $i = 0;
    while (<IN>){
        chomp;
        $i++;
        $count++;
        print STDERR "\r" . ($count/1000) . " k sequences read" 
            unless ($count % 1000);
        if ($n{$i}){
            $n++;
            #print "$n: $i -> $count\n";
            print "@" . $_;
        }
        # reset loop
        $i = 0 if $i >= $opt_l;
    }
    print STDERR "\r$count sequences read, $n written\n";
}
##############
# legacy Bioperl loop
##############
else {
    use Bio::SeqIO;
    # open files
    my $fastq_in = Bio::SeqIO->new(-file => $opt_q,
                                   -format => 'fastq-illumina');
    my $fastq_out = Bio::SeqIO->new(-fh => \*STDOUT,
                                    -format => 'fastq-illumina');
    # loop
    while (my $seq = $fastq_in->next_seq){
        $count++;
        print STDERR "\r" . ($count/1000) . " k sequences read" 
            unless ($count % 1000);
        # calculate residue of entire division
        my $residue = $count % $opt_l;
        if ($n{$residue}) {
            $n++;
            $fastq_out->write_seq($seq);
        }
    }
    print STDERR "\r$count sequences read, $n written\n";
}


sub usage{
    system("perldoc $0");
    exit;
}

On 17 Jan 2011, at 15:21 , Saima Zubair wrote:

> miramem gave me the following comments that shows huge coverage. So how can I 
> subsample my date to limit to coverage?
> 
> The contigs will have an average coverage of ~ 399.9 (+/- 10%)
> Estimates may be way off for pathological cases.
> 
> RAM estimates:
>            reads+contigs (unavoidable): 20.5 GiB
>                 large tables (tunable): 669. MiB
>                                         ---------
>                           total (peak): 21.2 GiB
> 
>             add if using -CL:pvlc=yes : 8.0 GiB
> 
> From: Lionel Guy <guy.lionel@xxxxxxxxx>
> To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
> Sent: Mon, January 17, 2011 3:09:07 PM
> Subject: [mira_talk] Re: mira denovo assembly of solexa reads
> 
> So I guess your expected coverage is probably extremely high... If your run 
> miramem (included in mira distribution), what does it tell you? 
> 
> If it is over 70X, you may try to subsample your data, to aim at 50-70x 
> coverage and rerun mira.
> 
> HTH, 
> 
> Lionel
> 
> On 17 janv. 2011, at 14:42, Saima Zubair <saim_zub@xxxxxxxxx> wrote:
> 
>> My expected genome size is about 2.1 to 2.2 MB. The version I am using is 
>> 3.2.0
>> 
>> From: Lionel Guy <guy.lionel@xxxxxxxxx>
>> To: mira_talk@xxxxxxxxxxxxx
>> Sent: Mon, January 17, 2011 2:04:26 PM
>> Subject: [mira_talk] Re: mira denovo assembly of solexa reads
>> 
>> Hi Saima,
>> 
>> What's your genome size (expected)? If you get very, very high coverage, you 
>> could have sky-high number of skims... (although 2^31 seems pretty big). 
>> What version of mira are you using?
>> 
>> Cheers,
>> Lionel
>> 
>> On 17 Jan 2011, at 13:52 , Saima Zubair wrote:
>> 
>> > Hi,
>> > 
>> > I am doing the denovo assembly of solexa reads of 100 bps long. My data is 
>> > in the form of two mate_pair files, each file is of 3 GB. My assembly runs 
>> > strangely for more than a week and then following error occurs with 
>> > controlled program stop;
>> > 
>> > 
>> > Internal logic/programming/debugging error (*sigh* this should not have 
>> > happened).
>> > 
>> > "More than 2^31 skim hits would be taken ... this was not foreseen yet.
>> > 
>> > ->Thrown: void Assembly::reduceSkimHits4(int32 version, const string 
>> > prefix, const string postfix, const string logname)
>> > ->Caught: main
>> > 
>> > Aborting process, probably due to an internal error.
>> > 
>> > I have tried it with many different parameters but assembly is not 
>> > complete yet. Should I filter the data before running command?
>> > 
>> > 
>> 
>> 
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
>> 
> 
> 

Other related posts: