[mira_talk] Re: BAC vector sequece masking for de novo assembly using PacBio C2

  • From: Juan Pascual Anaya <jpascualanaya@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 8 Dec 2013 21:16:25 +0900

Hello again, after a very loooong while. I haven't been working on this due
to other more urgent projects I had to do. Finally, I had sometime to write
a script to eliminate the sequence of the BAC vector (please, check below
the e-mail thread to remember), and was able to prepare some data and a
manifest file for MIRA4.

Just to summarize the pipeline of what I've done:

1) Self-correction of more than 100X coverage of PacBio reads (with C2 and
P4)

2) After self-correction, I get ~29.5X coverage with error-corrected-PacBio
reads

3) Align the reads to the sequence of the vectors with SSAHA2, output in
gff format
ssaha2 -best 1 -output gff vector.fasta pacbio.ALL_fastq >
pacbio.ALL_pCC1BAC.gff

4) Delete sequence of the vector, and in case it is in the middle of a
read, split the read into 2, 5' read and 3' read (>300 bp), without vector
sequence. For this, I use a perl script (find attached). I am new to
programming, so probably the code is very nasty... After this, I finally
get ~27.4X coverage, in principle enough to try an assembly..

And now my "problems" start with MIRA. I know that the genomic sequence of
this beast, the hagfish, contains nasty repetitive elements, and thus I got
high ratio of megahubs, ~1.5% in the first passes. When I got this
abortion, I reduced the -HS:nrr parameter (usually 100 for genome
assemblies, right?) to 50. Since this failed, I increased the -SK:mmhr to 2
(leaving -HS:nrr as default). I then got a 4% of megahubs, and then
increased the -SK:mmhr to 5 (find attached also the manifest file).

Now it's been running for a few hours without any complain, but I wonder
how good is what I'm doing... My goal is to get those repeats also
assembled, since they seem important for what I want to study (the Hox
clusters of the hagfish; Hox clusters are usually repetitive
sequence-free...).

Any advices?

Thanks in advance!
Champi


On Mon, May 27, 2013 at 9:23 AM, Juan Pascual Anaya <jpascualanaya@xxxxxxxxx
> wrote:

> thank you guys for all this information!
>
> I'll reply point by point.
>
> @bastien & peter: 1234xxxxxxxxxxx5678 is exactcly what I have, and taking
> into account that xxxxxxx is the sequence of the BAC vector (not the
> sequencing platform vector), 1234 and 5678 are kind of paired, yes... but
> "absolute" paired, since they are the extremes of the BAC insert (~120 Kb
> long). I could use that information somehow to see if the final assembly is
> correct (i.e., the sequence of 1234 and 5678 should be at the end, but in
> reverse-complement for one of them: e.g.,
> 5'end-5678yyyyyyyyyyyyyyyyyyyyy4321-3'end.
>
> For me, just treating those 1234 and 5678 as separate read would be
> enough, I think.
>
> @Juan: Sequencing every BAC independently would have required to bar-code
> each sample, and to make a library per each. I pooled the BACs to save
> money. I'll do exactly what you propose, align with SSAHA2 and use the
> coordinated to split the extrems into two reads.
>
> @Evan: I don't think you can get raw PacBio reads. Anyway, I have no
> problem with the vector/adaptors used for sequencing, but with my specific
> BAC vector (I have up to 15 clones from a BAC library that would like to
> sequence). There are several datasets to download in the PacBio webpage:
> http://pacbiodevnet.com/
>
> Having said this, my level of programming is (below) beginner, so it will
> take time until I get to write some decent script, but as soon as I do it,
> I'll post it here so everyone can use it.
>
> Thank you very much guys!!
> Champi (eveyone calls me like that, so can you :)
>
> PS: I have gotten a long reply in Seqanswers that seems interesting if you
> want to have a look:
> http://seqanswers.com/forums/showthread.php?t=30439&referrerid=23528
>



-- 

Juan Pascual-Anaya, PhD
Research Scientist
Laboratory for Evolutionary Morphology
Center for Developmental Biology (CDB) RIKEN
2-2-3 Minatojima-minamimachi
Chuo-ku, Kobe, Hyogo 650-0047
Japan
# Manifest file for assembly of pacbio corrected, trim and vector masked reads 
for MixBACs of Hox genes of E.burgeri

# First part: defining some basic things
# We just give a name to the assembly
# and tell MIRA it should assemble a genome de-novo in accurate mode

project = MixBACs.ALLreads.miraasm
job = genome,denovo,accurate
parameters = COMMON_SETTINGS -GE:not=12 -SK:mmhr=5
parameters = PCBIOHQ_SETTINGS -CL:pec=no


# The second part defines the sequencing data MIRA should load and assemble

readgroup = MixBACs_corrected.trim.novector_reads
data = 
/home/champi/Documents/Hagfish/pacbio/MixBACs/assemblies/runCA_correct-mask-asm/pacbio_underscored.trim-split.fastq
technology = pcbiohq
#! /usr/bin/perl -w

################################################################################
#
#                       delete_mased_seq.pl
#
#   Program designed to eliminate masked sequence from reads, in gff format,
# masked using SSAHA2. If the alignment is in the middle of the read, the
# sequences at both sides of the masked one will be splitted into two reads.
#
#   Author: Juan Pascual-Anaya
#   jpascualanaya@xxxxxxxxx
#   Date: 2013.June.25th
#   Last modification: 2013.December.7th
#
################################################################################



use strict;
use warnings;

my $usage="USAGE: perl delete_masked_seq.pl reads_file.fq SSAHA2_alignment.gff";

my $reads_filename=$ARGV[0];
chomp $reads_filename;

my $alignment_filename=$ARGV[1];
chomp $alignment_filename;

# Do the files exist?
unless ( -e $reads_filename) {
    print "File \"$reads_filename\" doesn\'t seem to exist!!\n$usage\n\n";
    exit;
}

unless ( -e $alignment_filename) {
    print "File \"$alignment_filename\" doesn\'t seem to exist!!\n$usage\n\n";
    exit;
}

# Open alignment file in gff format from SSAHA2 output
unless (open (ALIGNMENT, $alignment_filename)) {
    print "Cannot open file \"$alignment_filename\"\n\n";
    exit;
}

# Define variables

my @alignm_match;
my $read_name;
my @coordinates;
my @sorted_coordinates;
my %start_coord;
my %end_coord;
my $begin_read='@';

<ALIGNMENT>;
while (<ALIGNMENT>){
    if (s/^gff\: //){
        #Take the line and split it using the tab separator
        @alignm_match=split(/\t/);
    
        #Take the first column as the name of the read
        $read_name=$begin_read.$alignm_match[0];
    
        # Take the start (3rd column) and end (4th column) coordinate of the 
matching read sequence and sort them,
        # since they can be in the other way around in complementary strand 
matchings
        @coordinates=($alignm_match[3],$alignm_match[4]);
        @sorted_coordinates = sort { $a <=> $b } @coordinates;
    
        #Introduce the start and end coordinates into hashes, associated to the 
read name as key
        $start_coord{$read_name}=$sorted_coordinates[0];
        $end_coord{$read_name}=$sorted_coordinates[1];
        
    } else {next;}
}

close ALIGNMENT;

# Can we open the file with the reads?
unless ( open (READSFILE, $reads_filename) ) {
    print "Cannot open file \"$reads_filename\"\n\n$usage\n\n";
    exit;
}

# Define input separator for reads file: first, read the first line of the file 
into the variable $h,
# then assign the @ symbol of the fastq header plus the three subsequent 
letters as the regex and
# assign it to the input separator. Then, close the file.

my $h=<READSFILE>;
($/)=$h=~/^(\@.{3})/;
close READSFILE;

# Open the file again to reinitialize the file parsing from the first line

unless ( open (READSFILE, $reads_filename) ) {
    print "Cannot open file \"$reads_filename\"\n\n$usage\n\n";
    exit;
}

# Define variables of fastq file: header of the read, the sequence, the quality 
header (+), and the quality scores
my $header='';
my $seq='';
my $qheader='';
my $qscore='';
my $total=0;
my $del=$/;
my $root;
my $seq_5prime;
my $qscore_5prime;
my $seq_3prime;
my $qscore_3prime;


# Create the output file, using the same name as the one of the input file
($root)=$ARGV[0]=~/(.+)\./;

unless (open (OUTPUT, ">$root-split.fastq") ){
    print "Cannot create an output file\n\n";
    exit;
}

# Parse the reads one by one, check if it exists in the gff file
<READSFILE>;
while (<READSFILE>){
    /\n(.+?)\n(.+?)\n(.+)/;
    $header="$del$`";
    $seq=$1;
    $qheader=$2;
    $qscore=$3;
    # $seq=~s/[^A-Za-z]//g; No se limpia la secuencia en un fastq para no 
variar el numero de caracteres del string, ya que tiene
    #que ser el mismo que en el string de las quality scores.
    
    # Increase the count of parsed reads
    $total++;
    
    if ($start_coord{$header}) {
        $seq_5prime=substr($seq,0,0+$start_coord{$header}-1);
        $qscore_5prime=substr($qscore,0,0+$start_coord{$header}-1);
        $seq_3prime=substr($seq,$end_coord{$header}, ((length $seq) - 
$end_coord{$header}));
        $qscore_3prime=substr($qscore,$end_coord{$header}, ((length $qscore) - 
$end_coord{$header}));
        
        if (length $seq_5prime > 300) {
            print OUTPUT "$header\/up\n$seq_5prime\n$qheader\n$qscore_5prime\n";
        }
        
        if (length $seq_3prime > 300) {
            print OUTPUT 
"$header\/down\n$seq_3prime\n$qheader\n$qscore_3prime\n";
        }
    }else{
        print OUTPUT "$header\n$seq\n$qheader\n$qscore\n";
    }
}

print "\nParsed $total reads\n\n";

exit;

Other related posts: