[mira_talk] Re: Scaffolding contigs

  • From: Jason Steen <j.steen2@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 10 Oct 2013 01:44:46 +0000

I would expect ~75% of the reads to be in the correct RF orientation.  That is 
a VERY long insert size that it is calculating.

---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4957
www.ecogenomics.org



From: Shaun Tyler 
<Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Thursday, 10 October 2013 8:58 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Scaffolding contigs


Perhaps this warrants a new thread but for now I'll risk a reprimand ;-)

FYI - I am still working with my cutadapt sequences but after looking over the 
trimming results very few of the reads were actually trimmed (maybe  2% or so).

With the gel free nextera mate pair libraries what percentage of reads are 
actually in the RF orientation?  Based on the technical aspects of the 
procedure I would think it should be pretty high.  However, when I'm running 
SSPACE it seems that a lot of the read pairs are unsatisfactory.   Within 
contigs a lot are being flagged due to pairing logic.  But between contig pairs 
a lot fail the calculated distance.  So I guess my second question is what 
distances do you use in SSPACE for the gel free libraries.  (see the example 
below)

I suspect I'm just not using the correct scaffolding parameters but I've been 
trying a lot of variables and all of the results still seem a little weird.  So 
I'd be curious to know what others are using/doing.

Shaun



READ PAIRS STATS:
        Assembled pairs: 408655 (817310 sequences)
                Satisfied in distance/logic within contigs (i.e. -> <-, 
distance on target: 10000 +/-9000): 1031
                Unsatisfied in distance within contigs (i.e. distance 
out-of-bounds): 442
                Unsatisfied pairing logic within contigs (i.e. illogical 
pairing ->->, <-<- or <-->): 4031
                ---
                Satisfied in distance/logic within a given contig pair 
(pre-scaffold): 151973
                Unsatisfied in distance within a given contig pair (i.e. 
calculated distances out-of-bounds): 251178
                ---
        Total satisfied: 153004 unsatisfied: 255651


        Estimated insert size statistics (based on 1473 pairs):
                Mean insert size = 14498
                Median insert size = 8327
REPEATS:
        Number of repeated edges = 64



[Inactive hide details for Jason Steen ---2013-10-09 04:55:58 PM---Sorry, its 
reliant on the following perl module http://search]Jason Steen ---2013-10-09 
04:55:58 PM---Sorry, its reliant on the following perl module 
http://search.cpan.org/~jhi/String-Approx-3.26/Appro

From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>>
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: 2013-10-09 04:55 PM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx>
________________________________



Sorry, its reliant on the following perl module

http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm

Ill send usage information when I get into the office.

From: Shaun Tyler 
<Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Thursday, 10 October 2013 3:12 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Scaffolding contigs

So I'm not off to a great start.  I get the following error

Can't locate String/Approx.pm in @INC (@INC contains: 
/opt/smrtanalysis/analysis/lib /opt/cg-pipeline/lkatz/dependencies/tRNAscan/bin 
/opt/cg-pipeline/lkatz/dependencies/cpanlib/lib/perl5/x86_64-linux-thread-multi 
/opt/cg-pipeline/lkatz/dependencies/cpanlib/lib/perl5 
/usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi 
/usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl 
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi 
/usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl 
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at 
process_nextera_matepairs_variable_readlength.pl line 30.
BEGIN failed--compilation aborted at 
process_nextera_matepairs_variable_readlength.pl line 30.

Also could you clarify the command usage.  Thanks.

Shaun

[Inactive hide details for Jason Steen ---2013-10-08 04:50:51 PM---Well, my 
script is aware of both reads.  It just gathers some]Jason Steen ---2013-10-08 
04:50:51 PM---Well, my script is aware of both reads.  It just gathers some 
statistics on the way through.  It was

From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>>
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: 2013-10-08 04:50 PM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx>
________________________________



Well, my script is aware of both reads.  It just gathers some statistics on the 
way through.  It was written for my lab, from the information in the technote 
you posted.

It is not an SFF extract variant, it looks at both pairs, determines where the 
internal adaptor falls (using slightly fuzzy matching), and choses the most 
likely mate pair fragment from that information. Well, at least that’s what its 
done the last dozen times I've run it.  Data processed in this way has been 
used to super scaffold metagenome assemblies, and sspace output correlates well 
with binning information of contigs using other inhouse tools (ie, more contigs 
are joined within intrabin vs interbin).  We have used it on both gel extracted 
and gel free llibraries.

Cheers



From: Shaun Tyler 
<Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Wednesday, 9 October 2013 12:00 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Scaffolding contigs

I'll give it a shot but .........

The way you are proposing to use the data is what is typically done for 454 
mate pair libraries where you only have the single read.  sff_extract does 
essentially what your script apparently does.  This may very well work but it 
is not the way the data is intended to be used

http://res.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf

I also suspect that very few mate pairs will be found.  The final libraries 
range from about 400 bp - 2000 bp with a mean around 900 bp.  So even with 2 x 
250 bp reads the likelihood of sequencing across the adapter is pretty slim.

Shaun


[Inactive hide details for Jason Steen ---2013-10-07 09:35:35 PM---You mention 
using cutadapt to:  I used cutadapt to trim off t]Jason Steen ---2013-10-07 
09:35:35 PM---You mention using cutadapt to:  I used cutadapt to trim off the 
internal adapter and anything after

From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>>
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: 2013-10-07 09:35 PM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx>
________________________________



You mention using cutadapt to:

I used cutadapt to trim off the internal adapter and anything after


this sounds bad, to my ears.  The part after the internal adaptor (if it 
exists) is the second read..

Although I'm quite embarrased by my perl skills, here is a perl script that 
I've written to process nextera matepair raw data.  Please don’t laugh too hard.

https://github.com/jasteen/nextera_matepairs

There are two scripts, the one ending in "_variable_lengths", simply takes an 
extra command line parameter incase your data isnt 250bp data.

Can you put your raw data through processing and return the output?  Maybe take 
the output processed reads and try sspace again too.

Cheers

Jason

---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4957
www.ecogenomics.org



From: Shaun Tyler 
<Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Tuesday, 8 October 2013 8:30 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Scaffolding contigs

I'm hoping I'm just doing something stupid because the scaffolding just doesn't 
make sense.  Here's a bit of background.

The initial assemblies were hybrids with 454 titanium and MiSeq paired end (2 x 
250 bp).  We were interested in how the nextera mate pair performed so we 
re-sequenced 10 isolates for scaffolding.  These were done using the Gel Free 
protocol so the insert distribution is rather large.   I've included the post 
tagementation traces for the 2 strains I've been comparing to give you an idea 
of the insert span prior to circularization.  After sequencing I used cutadapt 
to trim off the internal adapter and anything after from the reads.



(See attached file: sspace_maps.docx)

Note:  These 2 isolates have identical PFGE patterns and only differ from each 
other by just over 300 SNPs when compared to the reference sequence.  They also 
appear identical using a number of other typing methods.

The only switches I've used in running SSPACE are the following:   -k 5 -a 0.7 
-g 3 -T 4 -x 0    I didn't want to extend and merge contigs because I was 
unsure of what might happen with repeats.

And the mate file looks like this:  lib1 2003-069_M1.fastq 2003-069_M2.fastq 
10000 0.9 RF   This allows for inserts between 1-19 kb which was the only way I 
could get much of a reduction in the number of contigs.

I then did some assessments using Mauve in order to judge synteny and 
consistancy
- first 2 panels show the original assemblies compared to the reference 
sequence.
- next is the SSPACE scaffold for one strain against the reference sequence 
(notice the decrease in synteny) .
- the largest scaffold accounted for approx 50% of the genome so I compared 
that to the reference to get a clearer picture.
- then I used that large scaffold contig as a basis for ordering the contigs of 
the second strain.
- and finally used the same scoffold contig to order the SSPACE contigs of the 
second strain.

I should also mention that these are neisseria spp which are prone to a lot of 
recombination but my understanding is it is more homologous substitutions as 
opposed to intra genomic rearrangements.  So considering the high degree of 
similarity between these two strains by other methods I'm having difficulty 
accepting the resulting scaffolds.

Also the documentation for SSPACE is a little sparse and I'm having difficulty 
really understanding some of the outputs.  In the summary I typically see this 
type of output

LIBRARY lib1 STATS:
################################################################################

MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 898816
Number of pairs used for pairing contigs / total pairs = 197426 / 197426
------------------------------------------------------------

READ PAIRS STATS:
Assembled pairs: 197426 (394852 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 
10000 +/-9000): 752
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 508
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or 
<-->): 3643
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 46814
Unsatisfied in distance within a given contig pair (i.e. calculated distances 
out-of-bounds): 145709
---
Total satisfied: 47566 unsatisfied: 149860


Estimated insert size statistics (based on 1260 pairs):
Mean insert size = 27823
Median insert size = 10632
REPEATS:
Number of repeated edges = 15

I take it that a lot of the data has some sort of issues hence the 
"unsatisfied"  but does that mean it's not used (I assume so).  And why would 
so much be unsatisfactory??  Could it be that the majority of the nextera mate 
pair reads are not really mate pairs (i.e. RF orientation) but normal paired 
end reads ????


Also in the evidence file I see things like this

>scaffold3|size544014|tigs13
r_tig32|size48304|links14|gaps-3188
f_tig18|size33213|links16|gaps-4579
f_tig46|size11990|links14|gaps-3172
f_tig12|size52012|links19|gaps-5385
f_tig10|size85701|links14|gaps-3798
r_tig9|size66090|links18|gaps-2378
r_tig26|size50117|links12|gaps-2939
f_tig43|size11986|links9|gaps577
r_tig28|size18850|links15|gaps-4130
r_tig20|size27631|links18|gaps-4682
f_tig8|size63306|links17|gaps-4055
f_tig38|size36077|links8|gaps-3291
r_tig17|size38149


Based on the - gaps I'm assuming it "merged" contigs which I didn't want to do. 
 After all Mira kept them separate for a reason.  Most of these scaffolds don't 
have any N's in them which would also indicate merger.  I repeated this 
particular scaffolding adding -n 6000 which is suppose to indicate the minimum 
overlap needed to merge contigs but the results were the same as without it.

So all in all I don't know what to think other than I don't trust the scaffolds 
I'm getting.  But is it the software, the data or the user that should be 
blamed.

Shaun

***********************************************
Shaun Tyler
National Microbiology Laboratory | Laboratoire national de microbiologie
Public Health Agency of Canada | Agence de la santé publique du Canada
Canadian Science Centre for Human and Animal Health | Centre scientifique 
canadien de la santé humaine et animale
Winnipeg, Canada  R3E 3P6
shaun.tyler@xxxxxxxxxxxxxxx<mailto:shaun.tyler@xxxxxxxxxxxxxxx>
Telephone | Téléphone 204-789-6030 / Facsimile | Télécopieur 204-789-2018
Government of Canada | Gouvernement du Canada

[Inactive hide details for David Coil ---2013-10-07 11:23:02 AM---I second the 
motion for having this discussion here as long as]David Coil ---2013-10-07 
11:23:02 AM---I second the motion for having this discussion here as long as 
Bastien doesn't mind.  I think a lot

From: David Coil <coil.david@xxxxxxxxx<mailto:coil.david@xxxxxxxxx>>
To: mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>
Date: 2013-10-07 11:23 AM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx>
________________________________



I second the motion for having this discussion here as long as Bastien doesn't 
mind.  I think a lot of MIRA users are using SSPACE.


David



On Mon, Oct 7, 2013 at 8:57 AM, Adrian Pelin 
<apelin20@xxxxxxxxx<mailto:apelin20@xxxxxxxxx>> wrote:

It doesn't need to go here, but I think is still very relevant, since we would 
be talking about scaffolding a mira assembly.

I have not yet found how to get sspace to work nicely. I tried forcing it to 
scaffold multiple times using the same assembly, but not sure how reliable 
those results are.

Sincerely,
Adrian

On Oct 7, 2013, at 11:39 AM, Shaun Tyler 
<Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> wrote:

I have some questions about optimising parameter settings for SSPACE that 
probably don't need to go to this group.  If you don't mind discussing drop me 
a line at shaun.tyler@xxxxxxxxxxxxxxx<mailto:shaun.tyler@xxxxxxxxxxxxxxx>.



Shaun

<graycol.gif>Jason Steen ---2013-10-01 10:53:56 PM---Sspace is good.  The free 
version does a nice job using nextera matepair data that we have generated


From: Jason Steen <j.steen2@xxxxxxxxx<mailto:j.steen2@xxxxxxxxx>>
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: 2013-10-01 10:53 PM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx>
________________________________



Sspace is good.  The free version does a nice job using nextera matepair data 
that we have generated.


---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4957<tel:%2B61%207%203365%204957>
www.ecogenomics.org<http://www.ecogenomics.org/>



From: <Walter>, Mathias <mathias@xxxxxxxxx<mailto:mathias@xxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Wednesday, 2 October 2013 4:58 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Scaffolding contigs

You can also try Opera and the scaffolder of SOAPdenovo.

--
Kind regards,
Mathias

2013/10/1 Bastien Chevreux <bach@xxxxxxxxxxxx<mailto:bach@xxxxxxxxxxxx>>:
On Oct 1, 2013, at 19:30 , Shaun Tyler 
<Shaun.Tyler@xxxxxxxxxxxxxxx<mailto:Shaun.Tyler@xxxxxxxxxxxxxxx>> wrote:
We started playing with the newish Nextera mate-pair libraries and I'm 
wondering what people are doing for scaffolding MIRA assemblies.  Right now I'm 
starting to look at SSpace and Bambus2 (Amos).  Any comments, suggestions, 
advice ??

Feedback from people here and off-list suggests SSpace to be a good first 
choice as it seems to be a lot easier to handle and get data into it than into 
Bambus.

B.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

[attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]
[attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]
[attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]

Attachment: graycol.gif
Description: graycol.gif

Other related posts: