[mira_talk] Re: Scaffolding contigs

  • From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 8 Oct 2013 19:00:24 -0500

Thanks.  Unfortunately I got busy with other stuff today so haven't had a
chance to give this a try.  I'm going to try shutting my door tomorrow to
see if that helps.  Maybe they'll think I'm not here ;-)

But for now do you have any comment on the final library size distribution
that we obtained.  Is this also typical of the libraries you've been using.
This is our first go at using nextera mate pairs so I have a lot of
questions on both the wet end as well as the analysis.

I probably don't fully understand what your script does so I'll just have
to give it a shot and see what happens.  I did take a quick look at it to
try and understand the logic but to be honest most of it was gobly-gook.
I'm more of lab rat than a computer geek.  I'm pretty good at figuring out
how to apply analysis tools but up until recently when people talked about
perl I thought they meant those thing you found in clams and things ;-)

Shaun



From:   Jason Steen <j.steen2@xxxxxxxxx>
To:     "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date:   2013-10-08 04:50 PM
Subject:        [mira_talk] Re: Scaffolding contigs
Sent by:        mira_talk-bounce@xxxxxxxxxxxxx



Well, my script is aware of both reads.  It just gathers some statistics on
the way through.  It was written for my lab, from the information in the
technote you posted.

It is not an SFF extract variant, it looks at both pairs, determines where
the internal adaptor falls (using slightly fuzzy matching), and choses the
most likely mate pair fragment from that information. Well, at least that’s
what its done the last dozen times I've run it.  Data processed in this way
has been used to super scaffold metagenome assemblies, and sspace output
correlates well with binning information of contigs using other inhouse
tools (ie, more contigs are joined within intrabin vs interbin).  We have
used it on both gel extracted and gel free llibraries.

Cheers



From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
Reply-To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Wednesday, 9 October 2013 12:00 AM
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Subject: [mira_talk] Re: Scaffolding contigs



I'll give it a shot but .........

The way you are proposing to use the data is what is typically done for 454
mate pair libraries where you only have the single read.  sff_extract does
essentially what your script apparently does.  This may very well work but
it is not the way the data is intended to be used

http://res.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf


I also suspect that very few mate pairs will be found.  The final libraries
range from about 400 bp - 2000 bp with a mean around 900 bp.  So even with
2 x 250 bp reads the likelihood of sequencing across the adapter is pretty
slim.

Shaun




Inactive hide details for Jason Steen ---2013-10-07 09:35:35 PM---You
mention using cutadapt to:  I used cutadapt to trim off tJason Steen
---2013-10-07 09:35:35 PM---You mention using cutadapt to:  I used cutadapt
to trim off the internal adapter and anything after

From: Jason Steen <j.steen2@xxxxxxxxx>
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: 2013-10-07 09:35 PM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx



You mention using cutadapt to:

 I used cutadapt to trim off the internal adapter and anything after


this sounds bad, to my ears.  The part after the internal adaptor (if it
exists) is the second read..

Although I'm quite embarrased by my perl skills, here is a perl script that
I've written to process nextera matepair raw data.  Please don’t laugh too
hard.

https://github.com/jasteen/nextera_matepairs

There are two scripts, the one ending in "_variable_lengths", simply takes
an extra command line parameter incase your data isnt 250bp data.

Can you put your raw data through processing and return the output?  Maybe
take the output processed reads and try sspace again too.

Cheers

Jason

---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4957
www.ecogenomics.org



From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
Reply-To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Tuesday, 8 October 2013 8:30 AM
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Subject: [mira_talk] Re: Scaffolding contigs


I'm hoping I'm just doing something stupid because the scaffolding just
doesn't make sense.  Here's a bit of background.

The initial assemblies were hybrids with 454 titanium and MiSeq paired end
(2 x 250 bp).  We were interested in how the nextera mate pair performed so
we re-sequenced 10 isolates for scaffolding.  These were done using the Gel
Free protocol so the insert distribution is rather large.   I've included
the post tagementation traces for the 2 strains I've been comparing to give
you an idea of the insert span prior to circularization.  After sequencing
I used cutadapt to trim off the internal adapter and anything after from
the reads.



      (See attached file: sspace_maps.docx)

Note:  These 2 isolates have identical PFGE patterns and only differ from
each other by just over 300 SNPs when compared to the reference sequence.
They also appear identical using a number of other typing methods.

The only switches I've used in running SSPACE are the following:   -k 5 -a
0.7 -g 3 -T 4 -x 0    I didn't want to extend and merge contigs because I
was unsure of what might happen with repeats.

      And the mate file looks like this:  lib1 2003-069_M1.fastq
      2003-069_M2.fastq 10000 0.9 RF   This allows for inserts between 1-19
      kb which was the only way I could get much of a reduction in the
      number of contigs.

      I then did some assessments using Mauve in order to judge synteny and
      consistancy
      - first 2 panels show the original assemblies compared to the
      reference sequence.
      - next is the SSPACE scaffold for one strain against the reference
      sequence (notice the decrease in synteny) .
      - the largest scaffold accounted for approx 50% of the genome so I
      compared that to the reference to get a clearer picture.
      - then I used that large scaffold contig as a basis for ordering the
      contigs of the second strain.
      - and finally used the same scoffold contig to order the SSPACE
      contigs of the second strain.
I should also mention that these are neisseria spp which are prone to a lot
of recombination but my understanding is it is more homologous
substitutions as opposed to intra genomic rearrangements.  So considering
the high degree of similarity between these two strains by other methods
I'm having difficulty accepting the resulting scaffolds.

Also the documentation for SSPACE is a little sparse and I'm having
difficulty really understanding some of the outputs.  In the summary I
typically see this type of output

LIBRARY lib1 STATS:
################################################################################


MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 898816
Number of pairs used for pairing contigs / total pairs = 197426 / 197426
------------------------------------------------------------

READ PAIRS STATS:
Assembled pairs: 197426 (394852 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target:
10000 +/-9000): 752
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 508
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<-
or <-->): 3643
---
Satisfied in distance/logic within a given contig pair (pre-scaffold):
46814
Unsatisfied in distance within a given contig pair (i.e. calculated
distances out-of-bounds): 145709
---
Total satisfied: 47566 unsatisfied: 149860


Estimated insert size statistics (based on 1260 pairs):
Mean insert size = 27823
Median insert size = 10632
REPEATS:
Number of repeated edges = 15
      I take it that a lot of the data has some sort of issues hence the
      "unsatisfied"  but does that mean it's not used (I assume so).  And
      why would so much be unsatisfactory??  Could it be that the majority
      of the nextera mate pair reads are not really mate pairs (i.e. RF
      orientation) but normal paired end reads ????


      Also in the evidence file I see things like this
>scaffold3|size544014|tigs13
r_tig32|size48304|links14|gaps-3188
f_tig18|size33213|links16|gaps-4579
f_tig46|size11990|links14|gaps-3172
f_tig12|size52012|links19|gaps-5385
f_tig10|size85701|links14|gaps-3798
r_tig9|size66090|links18|gaps-2378
r_tig26|size50117|links12|gaps-2939
f_tig43|size11986|links9|gaps577
r_tig28|size18850|links15|gaps-4130
r_tig20|size27631|links18|gaps-4682
f_tig8|size63306|links17|gaps-4055
f_tig38|size36077|links8|gaps-3291
r_tig17|size38149


      Based on the - gaps I'm assuming it "merged" contigs which I didn't
      want to do.  After all Mira kept them separate for a reason.  Most of
      these scaffolds don't have any N's in them which would also indicate
      merger.  I repeated this particular scaffolding adding -n 6000 which
      is suppose to indicate the minimum overlap needed to merge contigs
      but the results were the same as without it.

      So all in all I don't know what to think other than I don't trust the
      scaffolds I'm getting.  But is it the software, the data or the user
      that should be blamed.

      Shaun

      ***********************************************
      Shaun Tyler
      National Microbiology Laboratory | Laboratoire national de
      microbiologie
      Public Health Agency of Canada | Agence de la santé publique du
      Canada
      Canadian Science Centre for Human and Animal Health | Centre
      scientifique canadien de la santé humaine et animale
      Winnipeg, Canada  R3E 3P6
      shaun.tyler@xxxxxxxxxxxxxxx
      Telephone | Téléphone 204-789-6030 / Facsimile | Télécopieur
      204-789-2018
      Government of Canada | Gouvernement du Canada

Inactive hide details for David Coil ---2013-10-07 11:23:02 AM---I second
the motion for having this discussion here as long asDavid Coil
---2013-10-07 11:23:02 AM---I second the motion for having this discussion
here as long as Bastien doesn't mind.  I think a lot

From: David Coil <coil.david@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: 2013-10-07 11:23 AM
Subject: [mira_talk] Re: Scaffolding contigs
Sent by: mira_talk-bounce@xxxxxxxxxxxxx



I second the motion for having this discussion here as long as Bastien
doesn't mind.  I think a lot of MIRA users are using SSPACE.


David



On Mon, Oct 7, 2013 at 8:57 AM, Adrian Pelin <apelin20@xxxxxxxxx> wrote:
      It doesn't need to go here, but I think is still very relevant, since
      we would be talking about scaffolding a mira assembly.

      I have not yet found how to get sspace to work nicely. I tried
      forcing it to scaffold multiple times using the same assembly, but
      not sure how reliable those results are.

      Sincerely,
      Adrian

      On Oct 7, 2013, at 11:39 AM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx
      > wrote:

            I have some questions about optimising parameter settings for
            SSPACE that probably don't need to go to this group.  If you
            don't mind discussing drop me a line at
            shaun.tyler@xxxxxxxxxxxxxxx.



                  Shaun

            <graycol.gif>Jason Steen ---2013-10-01 10:53:56 PM---Sspace is
            good.  The free version does a nice job using nextera matepair
            data that we have generated


            From: Jason Steen <j.steen2@xxxxxxxxx>
            To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
            Date: 2013-10-01 10:53 PM
            Subject: [mira_talk] Re: Scaffolding contigs
            Sent by: mira_talk-bounce@xxxxxxxxxxxxx



            Sspace is good.  The free version does a nice job using nextera
            matepair data that we have generated.


            ---

            Dr Jason Steen
            Research Officer
            Australian Centre for Ecogenomics
            Ph : +61 7 3365 4957
            www.ecogenomics.org



            From: <Walter>, Mathias <mathias@xxxxxxxxx>
            Reply-To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
            Date: Wednesday, 2 October 2013 4:58 AM
            To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
            Subject: [mira_talk] Re: Scaffolding contigs

            You can also try Opera and the scaffolder of SOAPdenovo.

            --
            Kind regards,
            Mathias

            2013/10/1 Bastien Chevreux <bach@xxxxxxxxxxxx>:
                  On Oct 1, 2013, at 19:30 , Shaun Tyler <
                  Shaun.Tyler@xxxxxxxxxxxxxxx> wrote:
                        We started playing with the newish Nextera
                        mate-pair libraries and I'm wondering what people
                        are doing for scaffolding MIRA assemblies.  Right
                        now I'm starting to look at SSpace and Bambus2
                        (Amos).  Any comments, suggestions, advice ??

                  Feedback from people here and off-list suggests SSpace to
                  be a good first choice as it seems to be a lot easier to
                  handle and get data into it than into Bambus.

                  B.


                  --
                  You have received this mail because you are subscribed to
                  the mira_talk mailing list. For information on how to
                  subscribe or unsubscribe, please visit
                  http://www.chevreux.org/mira_mailinglists.html

            --
            You have received this mail because you are subscribed to the
            mira_talk mailing list. For information on how to subscribe or
            unsubscribe, please visit
            http://www.chevreux.org/mira_mailinglists.html

[attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]
[attachment "graycol.gif" deleted by Shaun Tyler/HC-SC/GC/CA]

GIF image

Other related posts: