Thanks. Unfortunately I got busy with other stuff today so haven't had a chance to give this a try. I'm going to try shutting my door tomorrow to see if that helps. Maybe they'll think I'm not here ;-) But for now do you have any comment on the final library size distribution that we obtained. Is this also typical of the libraries you've been using. This is our first go at using nextera mate pairs so I have a lot of questions on both the wet end as well as the analysis. I probably don't fully understand what your script does so I'll just have to give it a shot and see what happens. I did take a quick look at it to try and understand the logic but to be honest most of it was gobly-gook. I'm more of lab rat than a computer geek. I'm pretty good at figuring out how to apply analysis tools but up until recently when people talked about perl I thought they meant those thing you found in clams and things ;-) Shaun From: Jason Steen <j.steen2@xxxxxxxxx> To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Date: 2013-10-08 04:50 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx Well, my script is aware of both reads. It just gathers some statistics on the way through. It was written for my lab, from the information in the technote you posted. It is not an SFF extract variant, it looks at both pairs, determines where the internal adaptor falls (using slightly fuzzy matching), and choses the most likely mate pair fragment from that information. Well, at least that’s what its done the last dozen times I've run it. Data processed in this way has been used to super scaffold metagenome assemblies, and sspace output correlates well with binning information of contigs using other inhouse tools (ie, more contigs are joined within intrabin vs interbin). We have used it on both gel extracted and gel free llibraries. Cheers From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx> Reply-To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Date: Wednesday, 9 October 2013 12:00 AM To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Subject: [mira_talk] Re: Scaffolding contigs I'll give it a shot but ......... The way you are proposing to use the data is what is typically done for 454 mate pair libraries where you only have the single read. sff_extract does essentially what your script apparently does. This may very well work but it is not the way the data is intended to be used http://res.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf I also suspect that very few mate pairs will be found. The final libraries range from about 400 bp - 2000 bp with a mean around 900 bp. So even with 2 x 250 bp reads the likelihood of sequencing across the adapter is pretty slim. Shaun Inactive hide details for Jason Steen ---2013-10-07 09:35:35 PM---You mention using cutadapt to: I used cutadapt to trim off tJason Steen ---2013-10-07 09:35:35 PM---You mention using cutadapt to: I used cutadapt to trim off the internal adapter and anything after From: Jason Steen <j.steen2@xxxxxxxxx> To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Date: 2013-10-07 09:35 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx You mention using cutadapt to: I used cutadapt to trim off the internal adapter and anything after this sounds bad, to my ears. The part after the internal adaptor (if it exists) is the second read.. Although I'm quite embarrased by my perl skills, here is a perl script that I've written to process nextera matepair raw data. Please don’t laugh too hard. https://github.com/jasteen/nextera_matepairs There are two scripts, the one ending in "_variable_lengths", simply takes an extra command line parameter incase your data isnt 250bp data. Can you put your raw data through processing and return the output? Maybe take the output processed reads and try sspace again too. Cheers Jason --- Dr Jason Steen Research Officer Australian Centre for Ecogenomics Ph : +61 7 3365 4957 www.ecogenomics.org From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx> Reply-To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Date: Tuesday, 8 October 2013 8:30 AM To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Subject: [mira_talk] Re: Scaffolding contigs I'm hoping I'm just doing something stupid because the scaffolding just doesn't make sense. Here's a bit of background. The initial assemblies were hybrids with 454 titanium and MiSeq paired end (2 x 250 bp). We were interested in how the nextera mate pair performed so we re-sequenced 10 isolates for scaffolding. These were done using the Gel Free protocol so the insert distribution is rather large. I've included the post tagementation traces for the 2 strains I've been comparing to give you an idea of the insert span prior to circularization. After sequencing I used cutadapt to trim off the internal adapter and anything after from the reads. (See attached file: sspace_maps.docx) Note: These 2 isolates have identical PFGE patterns and only differ from each other by just over 300 SNPs when compared to the reference sequence. They also appear identical using a number of other typing methods. The only switches I've used in running SSPACE are the following: -k 5 -a 0.7 -g 3 -T 4 -x 0 I didn't want to extend and merge contigs because I was unsure of what might happen with repeats. And the mate file looks like this: lib1 2003-069_M1.fastq 2003-069_M2.fastq 10000 0.9 RF This allows for inserts between 1-19 kb which was the only way I could get much of a reduction in the number of contigs. I then did some assessments using Mauve in order to judge synteny and consistancy - first 2 panels show the original assemblies compared to the reference sequence. - next is the SSPACE scaffold for one strain against the reference sequence (notice the decrease in synteny) . - the largest scaffold accounted for approx 50% of the genome so I compared that to the reference to get a clearer picture. - then I used that large scaffold contig as a basis for ordering the contigs of the second strain. - and finally used the same scoffold contig to order the SSPACE contigs of the second strain. I should also mention that these are neisseria spp which are prone to a lot of recombination but my understanding is it is more homologous substitutions as opposed to intra genomic rearrangements. So considering the high degree of similarity between these two strains by other methods I'm having difficulty accepting the resulting scaffolds. Also the documentation for SSPACE is a little sparse and I'm having difficulty really understanding some of the outputs. In the summary I typically see this type of output LIBRARY lib1 STATS: ################################################################################ MAPPING READS TO CONTIGS: ------------------------------------------------------------ Number of single reads found on contigs = 898816 Number of pairs used for pairing contigs / total pairs = 197426 / 197426 ------------------------------------------------------------ READ PAIRS STATS: Assembled pairs: 197426 (394852 sequences) Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 10000 +/-9000): 752 Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 508 Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 3643 --- Satisfied in distance/logic within a given contig pair (pre-scaffold): 46814 Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 145709 --- Total satisfied: 47566 unsatisfied: 149860 Estimated insert size statistics (based on 1260 pairs): Mean insert size = 27823 Median insert size = 10632 REPEATS: Number of repeated edges = 15 I take it that a lot of the data has some sort of issues hence the "unsatisfied" but does that mean it's not used (I assume so). And why would so much be unsatisfactory?? Could it be that the majority of the nextera mate pair reads are not really mate pairs (i.e. RF orientation) but normal paired end reads ???? Also in the evidence file I see things like this >scaffold3|size544014|tigs13 r_tig32|size48304|links14|gaps-3188 f_tig18|size33213|links16|gaps-4579 f_tig46|size11990|links14|gaps-3172 f_tig12|size52012|links19|gaps-5385 f_tig10|size85701|links14|gaps-3798 r_tig9|size66090|links18|gaps-2378 r_tig26|size50117|links12|gaps-2939 f_tig43|size11986|links9|gaps577 r_tig28|size18850|links15|gaps-4130 r_tig20|size27631|links18|gaps-4682 f_tig8|size63306|links17|gaps-4055 f_tig38|size36077|links8|gaps-3291 r_tig17|size38149 Based on the - gaps I'm assuming it "merged" contigs which I didn't want to do. After all Mira kept them separate for a reason. Most of these scaffolds don't have any N's in them which would also indicate merger. I repeated this particular scaffolding adding -n 6000 which is suppose to indicate the minimum overlap needed to merge contigs but the results were the same as without it. So all in all I don't know what to think other than I don't trust the scaffolds I'm getting. But is it the software, the data or the user that should be blamed. Shaun *********************************************** Shaun Tyler National Microbiology Laboratory | Laboratoire national de microbiologie Public Health Agency of Canada | Agence de la santé publique du Canada Canadian Science Centre for Human and Animal Health | Centre scientifique canadien de la santé humaine et animale Winnipeg, Canada R3E 3P6 shaun.tyler@xxxxxxxxxxxxxxx Telephone | Téléphone 204-789-6030 / Facsimile | Télécopieur 204-789-2018 Government of Canada | Gouvernement du Canada Inactive hide details for David Coil ---2013-10-07 11:23:02 AM---I second the motion for having this discussion here as long asDavid Coil ---2013-10-07 11:23:02 AM---I second the motion for having this discussion here as long as Bastien doesn't mind. I think a lot From: David Coil <coil.david@xxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2013-10-07 11:23 AM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx I second the motion for having this discussion here as long as Bastien doesn't mind. I think a lot of MIRA users are using SSPACE. David On Mon, Oct 7, 2013 at 8:57 AM, Adrian Pelin <apelin20@xxxxxxxxx> wrote: It doesn't need to go here, but I think is still very relevant, since we would be talking about scaffolding a mira assembly. I have not yet found how to get sspace to work nicely. I tried forcing it to scaffold multiple times using the same assembly, but not sure how reliable those results are. Sincerely, Adrian On Oct 7, 2013, at 11:39 AM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx > wrote: I have some questions about optimising parameter settings for SSPACE that probably don't need to go to this group. If you don't mind discussing drop me a line at shaun.tyler@xxxxxxxxxxxxxxx. Shaun <graycol.gif>Jason Steen ---2013-10-01 10:53:56 PM---Sspace is good. The free version does a nice job using nextera matepair data that we have generated From: Jason Steen <j.steen2@xxxxxxxxx> To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Date: 2013-10-01 10:53 PM Subject: [mira_talk] Re: Scaffolding contigs Sent by: mira_talk-bounce@xxxxxxxxxxxxx Sspace is good. The free version does a nice job using nextera matepair data that we have generated. --- Dr Jason Steen Research Officer Australian Centre for Ecogenomics Ph : +61 7 3365 4957 www.ecogenomics.org From: <Walter>, Mathias <mathias@xxxxxxxxx> Reply-To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Date: Wednesday, 2 October 2013 4:58 AM To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx> Subject: [mira_talk] Re: Scaffolding contigs You can also try Opera and the scaffolder of SOAPdenovo. -- Kind regards, Mathias 2013/10/1 Bastien Chevreux <bach@xxxxxxxxxxxx>: On Oct 1, 2013, at 19:30 , Shaun Tyler < Shaun.Tyler@xxxxxxxxxxxxxxx> wrote: We started playing with the newish Nextera mate-pair libraries and I'm wondering what people are doing for scaffolding MIRA assemblies. Right now I'm starting to look at SSpace and Bambus2 (Amos). Any comments, suggestions, advice ?? Feedback from people here and off-list suggests SSpace to be a good first choice as it seems to be a lot easier to handle and get data into it than into Bambus. B. -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html [attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA] [attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]