I would expect ~75% of the reads to be in the correct RF orientation. That is a VERY long insert size that it is calculating. --- Dr Jason Steen Research Officer Australian Centre for Ecogenomics Ph : +61 7 3365 4957 www.ecogenomics.org From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: Thursday, 10 October 2013 8:58 AM To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Subject: [mira_talk] Re: Scaffolding contigs Perhaps this warrants a new thread but for now I'll risk a reprimand ;-) FYI - I am still working with my cutadapt sequences but after looking over the trimming results very few of the reads were actually trimmed (maybe 2% or so). With the gel free nextera mate pair libraries what percentage of reads are actually in the RF orientation? Based on the technical aspects of the procedure I would think it should be pretty high. However, when I'm running SSPACE it seems that a lot of the read pairs are unsatisfactory. Within contigs a lot are being flagged due to pairing logic. But between contig pairs a lot fail the calculated distance. So I guess my second question is what distances do you use in SSPACE for the gel free libraries. (see the example below) I suspect I'm just not using the correct scaffolding parameters but I've been trying a lot of variables and all of the results still seem a little weird. So I'd be curious to know what others are using/doing. Shaun READ PAIRS STATS: Assembled pairs: 408655 (817310 sequences) Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 10000 +/-9000): 1031 Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 442 Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 4031 --- Satisfied in distance/logic within a given contig pair (pre-scaffold): 151973 Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 251178 --- Total satisfied: 153004 unsatisfied: 255651 Estimated insert size statistics (based on 1473 pairs): Mean insert size = 14498 Median insert size = 8327 REPEATS: Number of repeated edges = 64 [Inactive hide details for Jason Steen ---2013-10-09 04:55:58 PM---Sorry, its reliant on the following perl module http://search]Jason Steen ---2013-10-09 04:55:58 PM---Sorry, its reliant on the following perl module http://search.cpan.org/~jhi/String-Approx-3.26/Appro From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>> To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: 2013-10-09 04:55 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx> ________________________________ Sorry, its reliant on the following perl module http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm Ill send usage information when I get into the office. From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: Thursday, 10 October 2013 3:12 AM To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Subject: [mira_talk] Re: Scaffolding contigs So I'm not off to a great start. I get the following error Can't locate String/Approx.pm in @INC (@INC contains: /opt/smrtanalysis/analysis/lib /opt/cg-pipeline/lkatz/dependencies/tRNAscan/bin /opt/cg-pipeline/lkatz/dependencies/cpanlib/lib/perl5/x86_64-linux-thread-multi /opt/cg-pipeline/lkatz/dependencies/cpanlib/lib/perl5 /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at process_nextera_matepairs_variable_readlength.pl line 30. BEGIN failed--compilation aborted at process_nextera_matepairs_variable_readlength.pl line 30. Also could you clarify the command usage. Thanks. Shaun [Inactive hide details for Jason Steen ---2013-10-08 04:50:51 PM---Well, my script is aware of both reads. It just gathers some]Jason Steen ---2013-10-08 04:50:51 PM---Well, my script is aware of both reads. It just gathers some statistics on the way through. It was From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>> To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: 2013-10-08 04:50 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx> ________________________________ Well, my script is aware of both reads. It just gathers some statistics on the way through. It was written for my lab, from the information in the technote you posted. It is not an SFF extract variant, it looks at both pairs, determines where the internal adaptor falls (using slightly fuzzy matching), and choses the most likely mate pair fragment from that information. Well, at least that’s what its done the last dozen times I've run it. Data processed in this way has been used to super scaffold metagenome assemblies, and sspace output correlates well with binning information of contigs using other inhouse tools (ie, more contigs are joined within intrabin vs interbin). We have used it on both gel extracted and gel free llibraries. Cheers From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: Wednesday, 9 October 2013 12:00 AM To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Subject: [mira_talk] Re: Scaffolding contigs I'll give it a shot but ......... The way you are proposing to use the data is what is typically done for 454 mate pair libraries where you only have the single read. sff_extract does essentially what your script apparently does. This may very well work but it is not the way the data is intended to be used http://res.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf I also suspect that very few mate pairs will be found. The final libraries range from about 400 bp - 2000 bp with a mean around 900 bp. So even with 2 x 250 bp reads the likelihood of sequencing across the adapter is pretty slim. Shaun [Inactive hide details for Jason Steen ---2013-10-07 09:35:35 PM---You mention using cutadapt to: I used cutadapt to trim off t]Jason Steen ---2013-10-07 09:35:35 PM---You mention using cutadapt to: I used cutadapt to trim off the internal adapter and anything after From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>> To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: 2013-10-07 09:35 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx> ________________________________ You mention using cutadapt to: I used cutadapt to trim off the internal adapter and anything after this sounds bad, to my ears. The part after the internal adaptor (if it exists) is the second read.. Although I'm quite embarrased by my perl skills, here is a perl script that I've written to process nextera matepair raw data. Please don’t laugh too hard. https://github.com/jasteen/nextera_matepairs There are two scripts, the one ending in "_variable_lengths", simply takes an extra command line parameter incase your data isnt 250bp data. Can you put your raw data through processing and return the output? Maybe take the output processed reads and try sspace again too. Cheers Jason --- Dr Jason Steen Research Officer Australian Centre for Ecogenomics Ph : +61 7 3365 4957 www.ecogenomics.org From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: Tuesday, 8 October 2013 8:30 AM To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Subject: [mira_talk] Re: Scaffolding contigs I'm hoping I'm just doing something stupid because the scaffolding just doesn't make sense. Here's a bit of background. The initial assemblies were hybrids with 454 titanium and MiSeq paired end (2 x 250 bp). We were interested in how the nextera mate pair performed so we re-sequenced 10 isolates for scaffolding. These were done using the Gel Free protocol so the insert distribution is rather large. I've included the post tagementation traces for the 2 strains I've been comparing to give you an idea of the insert span prior to circularization. After sequencing I used cutadapt to trim off the internal adapter and anything after from the reads. (See attached file: sspace_maps.docx) Note: These 2 isolates have identical PFGE patterns and only differ from each other by just over 300 SNPs when compared to the reference sequence. They also appear identical using a number of other typing methods. The only switches I've used in running SSPACE are the following: -k 5 -a 0.7 -g 3 -T 4 -x 0 I didn't want to extend and merge contigs because I was unsure of what might happen with repeats. And the mate file looks like this: lib1 2003-069_M1.fastq 2003-069_M2.fastq 10000 0.9 RF This allows for inserts between 1-19 kb which was the only way I could get much of a reduction in the number of contigs. I then did some assessments using Mauve in order to judge synteny and consistancy - first 2 panels show the original assemblies compared to the reference sequence. - next is the SSPACE scaffold for one strain against the reference sequence (notice the decrease in synteny) . - the largest scaffold accounted for approx 50% of the genome so I compared that to the reference to get a clearer picture. - then I used that large scaffold contig as a basis for ordering the contigs of the second strain. - and finally used the same scoffold contig to order the SSPACE contigs of the second strain. I should also mention that these are neisseria spp which are prone to a lot of recombination but my understanding is it is more homologous substitutions as opposed to intra genomic rearrangements. So considering the high degree of similarity between these two strains by other methods I'm having difficulty accepting the resulting scaffolds. Also the documentation for SSPACE is a little sparse and I'm having difficulty really understanding some of the outputs. In the summary I typically see this type of output LIBRARY lib1 STATS: ################################################################################ MAPPING READS TO CONTIGS: ------------------------------------------------------------ Number of single reads found on contigs = 898816 Number of pairs used for pairing contigs / total pairs = 197426 / 197426 ------------------------------------------------------------ READ PAIRS STATS: Assembled pairs: 197426 (394852 sequences) Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 10000 +/-9000): 752 Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 508 Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 3643 --- Satisfied in distance/logic within a given contig pair (pre-scaffold): 46814 Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 145709 --- Total satisfied: 47566 unsatisfied: 149860 Estimated insert size statistics (based on 1260 pairs): Mean insert size = 27823 Median insert size = 10632 REPEATS: Number of repeated edges = 15 I take it that a lot of the data has some sort of issues hence the "unsatisfied" but does that mean it's not used (I assume so). And why would so much be unsatisfactory?? Could it be that the majority of the nextera mate pair reads are not really mate pairs (i.e. RF orientation) but normal paired end reads ???? Also in the evidence file I see things like this >scaffold3|size544014|tigs13 r_tig32|size48304|links14|gaps-3188 f_tig18|size33213|links16|gaps-4579 f_tig46|size11990|links14|gaps-3172 f_tig12|size52012|links19|gaps-5385 f_tig10|size85701|links14|gaps-3798 r_tig9|size66090|links18|gaps-2378 r_tig26|size50117|links12|gaps-2939 f_tig43|size11986|links9|gaps577 r_tig28|size18850|links15|gaps-4130 r_tig20|size27631|links18|gaps-4682 f_tig8|size63306|links17|gaps-4055 f_tig38|size36077|links8|gaps-3291 r_tig17|size38149 Based on the - gaps I'm assuming it "merged" contigs which I didn't want to do. After all Mira kept them separate for a reason. Most of these scaffolds don't have any N's in them which would also indicate merger. I repeated this particular scaffolding adding -n 6000 which is suppose to indicate the minimum overlap needed to merge contigs but the results were the same as without it. So all in all I don't know what to think other than I don't trust the scaffolds I'm getting. But is it the software, the data or the user that should be blamed. Shaun *********************************************** Shaun Tyler National Microbiology Laboratory | Laboratoire national de microbiologie Public Health Agency of Canada | Agence de la santé publique du Canada Canadian Science Centre for Human and Animal Health | Centre scientifique canadien de la santé humaine et animale Winnipeg, Canada R3E 3P6 shaun.tyler@xxxxxxxxxxxxxxx<mailto:shaun.tyler@xxxxxxxxxxxxxxx> Telephone | Téléphone 204-789-6030 / Facsimile | Télécopieur 204-789-2018 Government of Canada | Gouvernement du Canada [Inactive hide details for David Coil ---2013-10-07 11:23:02 AM---I second the motion for having this discussion here as long as]David Coil ---2013-10-07 11:23:02 AM---I second the motion for having this discussion here as long as Bastien doesn't mind. I think a lot From: David Coil <coil.david@xxxxxxxxx<mailto:coil.david@xxxxxxxxx>> To: mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx> Date: 2013-10-07 11:23 AM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx> ________________________________ I second the motion for having this discussion here as long as Bastien doesn't mind. I think a lot of MIRA users are using SSPACE. David On Mon, Oct 7, 2013 at 8:57 AM, Adrian Pelin <apelin20@xxxxxxxxx<mailto:apelin20@xxxxxxxxx>> wrote: It doesn't need to go here, but I think is still very relevant, since we would be talking about scaffolding a mira assembly. I have not yet found how to get sspace to work nicely. I tried forcing it to scaffold multiple times using the same assembly, but not sure how reliable those results are. Sincerely, Adrian On Oct 7, 2013, at 11:39 AM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> wrote: I have some questions about optimising parameter settings for SSPACE that probably don't need to go to this group. If you don't mind discussing drop me a line at shaun.tyler@xxxxxxxxxxxxxxx<mailto:shaun.tyler@xxxxxxxxxxxxxxx>. Shaun <graycol.gif>Jason Steen ---2013-10-01 10:53:56 PM---Sspace is good. The free version does a nice job using nextera matepair data that we have generated From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>> To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: 2013-10-01 10:53 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx> ________________________________ Sspace is good. The free version does a nice job using nextera matepair data that we have generated. --- Dr Jason Steen Research Officer Australian Centre for Ecogenomics Ph : +61 7 3365 4957<tel:%2B61%207%203365%204957> www.ecogenomics.org<http://www.ecogenomics.org/> From: <Walter>, Mathias <mathias@xxxxxxxxx<mailto:mathias@xxxxxxxxx>> Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: Wednesday, 2 October 2013 4:58 AM To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Subject: [mira_talk] Re: Scaffolding contigs You can also try Opera and the scaffolder of SOAPdenovo. -- Kind regards, Mathias 2013/10/1 Bastien Chevreux <bach@xxxxxxxxxxxx<mailto:bach@xxxxxxxxxxxx>>: On Oct 1, 2013, at 19:30 , Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> wrote: We started playing with the newish Nextera mate-pair libraries and I'm wondering what people are doing for scaffolding MIRA assemblies. Right now I'm starting to look at SSpace and Bambus2 (Amos). Any comments, suggestions, advice ?? Feedback from people here and off-list suggests SSpace to be a good first choice as it seems to be a lot easier to handle and get data into it than into Bambus. B. -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html [attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA] [attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA] [attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]
Attachment:
graycol.gif
Description: graycol.gif