Hello again, after a very loooong while. I haven't been working on this due to other more urgent projects I had to do. Finally, I had sometime to write a script to eliminate the sequence of the BAC vector (please, check below the e-mail thread to remember), and was able to prepare some data and a manifest file for MIRA4. Just to summarize the pipeline of what I've done: 1) Self-correction of more than 100X coverage of PacBio reads (with C2 and P4) 2) After self-correction, I get ~29.5X coverage with error-corrected-PacBio reads 3) Align the reads to the sequence of the vectors with SSAHA2, output in gff format ssaha2 -best 1 -output gff vector.fasta pacbio.ALL_fastq > pacbio.ALL_pCC1BAC.gff 4) Delete sequence of the vector, and in case it is in the middle of a read, split the read into 2, 5' read and 3' read (>300 bp), without vector sequence. For this, I use a perl script (find attached). I am new to programming, so probably the code is very nasty... After this, I finally get ~27.4X coverage, in principle enough to try an assembly.. And now my "problems" start with MIRA. I know that the genomic sequence of this beast, the hagfish, contains nasty repetitive elements, and thus I got high ratio of megahubs, ~1.5% in the first passes. When I got this abortion, I reduced the -HS:nrr parameter (usually 100 for genome assemblies, right?) to 50. Since this failed, I increased the -SK:mmhr to 2 (leaving -HS:nrr as default). I then got a 4% of megahubs, and then increased the -SK:mmhr to 5 (find attached also the manifest file). Now it's been running for a few hours without any complain, but I wonder how good is what I'm doing... My goal is to get those repeats also assembled, since they seem important for what I want to study (the Hox clusters of the hagfish; Hox clusters are usually repetitive sequence-free...). Any advices? Thanks in advance! Champi On Mon, May 27, 2013 at 9:23 AM, Juan Pascual Anaya <jpascualanaya@xxxxxxxxx > wrote: > thank you guys for all this information! > > I'll reply point by point. > > @bastien & peter: 1234xxxxxxxxxxx5678 is exactcly what I have, and taking > into account that xxxxxxx is the sequence of the BAC vector (not the > sequencing platform vector), 1234 and 5678 are kind of paired, yes... but > "absolute" paired, since they are the extremes of the BAC insert (~120 Kb > long). I could use that information somehow to see if the final assembly is > correct (i.e., the sequence of 1234 and 5678 should be at the end, but in > reverse-complement for one of them: e.g., > 5'end-5678yyyyyyyyyyyyyyyyyyyyy4321-3'end. > > For me, just treating those 1234 and 5678 as separate read would be > enough, I think. > > @Juan: Sequencing every BAC independently would have required to bar-code > each sample, and to make a library per each. I pooled the BACs to save > money. I'll do exactly what you propose, align with SSAHA2 and use the > coordinated to split the extrems into two reads. > > @Evan: I don't think you can get raw PacBio reads. Anyway, I have no > problem with the vector/adaptors used for sequencing, but with my specific > BAC vector (I have up to 15 clones from a BAC library that would like to > sequence). There are several datasets to download in the PacBio webpage: > http://pacbiodevnet.com/ > > Having said this, my level of programming is (below) beginner, so it will > take time until I get to write some decent script, but as soon as I do it, > I'll post it here so everyone can use it. > > Thank you very much guys!! > Champi (eveyone calls me like that, so can you :) > > PS: I have gotten a long reply in Seqanswers that seems interesting if you > want to have a look: > http://seqanswers.com/forums/showthread.php?t=30439&referrerid=23528 > -- Juan Pascual-Anaya, PhD Research Scientist Laboratory for Evolutionary Morphology Center for Developmental Biology (CDB) RIKEN 2-2-3 Minatojima-minamimachi Chuo-ku, Kobe, Hyogo 650-0047 Japan
# Manifest file for assembly of pacbio corrected, trim and vector masked reads for MixBACs of Hox genes of E.burgeri # First part: defining some basic things # We just give a name to the assembly # and tell MIRA it should assemble a genome de-novo in accurate mode project = MixBACs.ALLreads.miraasm job = genome,denovo,accurate parameters = COMMON_SETTINGS -GE:not=12 -SK:mmhr=5 parameters = PCBIOHQ_SETTINGS -CL:pec=no # The second part defines the sequencing data MIRA should load and assemble readgroup = MixBACs_corrected.trim.novector_reads data = /home/champi/Documents/Hagfish/pacbio/MixBACs/assemblies/runCA_correct-mask-asm/pacbio_underscored.trim-split.fastq technology = pcbiohq
#! /usr/bin/perl -w ################################################################################ # # delete_mased_seq.pl # # Program designed to eliminate masked sequence from reads, in gff format, # masked using SSAHA2. If the alignment is in the middle of the read, the # sequences at both sides of the masked one will be splitted into two reads. # # Author: Juan Pascual-Anaya # jpascualanaya@xxxxxxxxx # Date: 2013.June.25th # Last modification: 2013.December.7th # ################################################################################ use strict; use warnings; my $usage="USAGE: perl delete_masked_seq.pl reads_file.fq SSAHA2_alignment.gff"; my $reads_filename=$ARGV[0]; chomp $reads_filename; my $alignment_filename=$ARGV[1]; chomp $alignment_filename; # Do the files exist? unless ( -e $reads_filename) { print "File \"$reads_filename\" doesn\'t seem to exist!!\n$usage\n\n"; exit; } unless ( -e $alignment_filename) { print "File \"$alignment_filename\" doesn\'t seem to exist!!\n$usage\n\n"; exit; } # Open alignment file in gff format from SSAHA2 output unless (open (ALIGNMENT, $alignment_filename)) { print "Cannot open file \"$alignment_filename\"\n\n"; exit; } # Define variables my @alignm_match; my $read_name; my @coordinates; my @sorted_coordinates; my %start_coord; my %end_coord; my $begin_read='@'; <ALIGNMENT>; while (<ALIGNMENT>){ if (s/^gff\: //){ #Take the line and split it using the tab separator @alignm_match=split(/\t/); #Take the first column as the name of the read $read_name=$begin_read.$alignm_match[0]; # Take the start (3rd column) and end (4th column) coordinate of the matching read sequence and sort them, # since they can be in the other way around in complementary strand matchings @coordinates=($alignm_match[3],$alignm_match[4]); @sorted_coordinates = sort { $a <=> $b } @coordinates; #Introduce the start and end coordinates into hashes, associated to the read name as key $start_coord{$read_name}=$sorted_coordinates[0]; $end_coord{$read_name}=$sorted_coordinates[1]; } else {next;} } close ALIGNMENT; # Can we open the file with the reads? unless ( open (READSFILE, $reads_filename) ) { print "Cannot open file \"$reads_filename\"\n\n$usage\n\n"; exit; } # Define input separator for reads file: first, read the first line of the file into the variable $h, # then assign the @ symbol of the fastq header plus the three subsequent letters as the regex and # assign it to the input separator. Then, close the file. my $h=<READSFILE>; ($/)=$h=~/^(\@.{3})/; close READSFILE; # Open the file again to reinitialize the file parsing from the first line unless ( open (READSFILE, $reads_filename) ) { print "Cannot open file \"$reads_filename\"\n\n$usage\n\n"; exit; } # Define variables of fastq file: header of the read, the sequence, the quality header (+), and the quality scores my $header=''; my $seq=''; my $qheader=''; my $qscore=''; my $total=0; my $del=$/; my $root; my $seq_5prime; my $qscore_5prime; my $seq_3prime; my $qscore_3prime; # Create the output file, using the same name as the one of the input file ($root)=$ARGV[0]=~/(.+)\./; unless (open (OUTPUT, ">$root-split.fastq") ){ print "Cannot create an output file\n\n"; exit; } # Parse the reads one by one, check if it exists in the gff file <READSFILE>; while (<READSFILE>){ /\n(.+?)\n(.+?)\n(.+)/; $header="$del$`"; $seq=$1; $qheader=$2; $qscore=$3; # $seq=~s/[^A-Za-z]//g; No se limpia la secuencia en un fastq para no variar el numero de caracteres del string, ya que tiene #que ser el mismo que en el string de las quality scores. # Increase the count of parsed reads $total++; if ($start_coord{$header}) { $seq_5prime=substr($seq,0,0+$start_coord{$header}-1); $qscore_5prime=substr($qscore,0,0+$start_coord{$header}-1); $seq_3prime=substr($seq,$end_coord{$header}, ((length $seq) - $end_coord{$header})); $qscore_3prime=substr($qscore,$end_coord{$header}, ((length $qscore) - $end_coord{$header})); if (length $seq_5prime > 300) { print OUTPUT "$header\/up\n$seq_5prime\n$qheader\n$qscore_5prime\n"; } if (length $seq_3prime > 300) { print OUTPUT "$header\/down\n$seq_3prime\n$qheader\n$qscore_3prime\n"; } }else{ print OUTPUT "$header\n$seq\n$qheader\n$qscore\n"; } } print "\nParsed $total reads\n\n"; exit;