[mira_talk] Re: Gap closure

  • From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 8 May 2012 11:53:30 -0500

I was referring to things like the files in the info folder.  The
contigstats file will give you some idea of the coverage for the individual
contigs, etc.  Contigs with excessive coverage are generally pile ups of
repeats.  This is not so much about closing gaps as it is understanding
the data you are working with.  The other files can be useful too but not
if you are manually working with them.  They really need to be parsed with
perl scripts or something similar.

Shaun






From:   Shankar Manoharan <shankarmanostar@xxxxxxxxx>
To:     mira_talk@xxxxxxxxxxxxx
Date:   2012-05-07 07:22 AM
Subject:        [mira_talk] Re: Gap closure
Sent by:        mira_talk-bounce@xxxxxxxxxxxxx



Dear Shaun...

Thanks a LOT for that detailed and very helpful walk through. I work on
whole genome sequencing of bacterial isolates. We have roughly 3-4
references for my species. The objective of my work is to establish a gene
function relationship to certain characteristics of my isolate. Of 14
repeats, two coded for proteins, the remaining 12 were 23S, 16S and 5S rRNA
encoding sequences (Yes multiple copies of the same thing). It is essential
for me to complete the genomic gaps by whatever means that may be
available. I'll first try to order the contigs before proceeding further.
Thanks for suggesting the tools. You mentioned something about ancillary
data generated by MIRA. Would you mind explaining what it is and how it
will help in gap closure ?

Best regards

Shankar Manoharan
Graduate Student
Department of Genetics
Madurai Kamaraj University
Ph. +919790167534

I strongly believe in doing my best and leaving the rest to God




On Mon, May 7, 2012 at 4:46 AM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
wrote:
  There is no “this is how you do it” when it comes to genome closure.
  Every project is different and will have different challenges which
  require different approaches.  However, I can offer a few tips.


  First off and most importantly is to know what it is you want to
  accomplish.  This is a question that should have been addressed before
  you even started with the sequencing as it ultimately impacts on the
  route you need to take.  Characterizing a novel organism is different
  from doing a comparative project looking at differences in gene content,
  SNPs, phylogenetic relatedness, etc.  Too often people embark on these
  projects without having a clear idea of the questions they want answered
  only to find out that the tract they’ve taken is not really appropriate.
  But your question was on closing gaps.


  Basically you’ll do this by PCR and conventional sequencing.  Typically
  the genome coverage will be close to 100 % so the gaps you are left with
  are generally fairly short or are due to repeats like rRNA sequences, IS
  elements and things like that.  The first thing to do is to get to know
  your data.  You mentioned 14 repeats.  What are they?  Are they different
  copies of the same thing?   How big are they and what coverage do they
  have compared to the rest of the genome?  They may be separate contigs
  but could still represent multiple copies of the same thing.  Do you care
  about these sequences and closing the gaps they create?


  Start with doing some simple BLAST searches and looking over the
  ancillary data files created by MIRA.  The better you understand the data
  the better off you will be in planning a route of attack.  You should
  also use something like GAP5, Tablet or other pile up viewers to assess
  the contigs created.  Inconsistencies in depth of coverage or paired end
  distribution could indicate misassemblies.  MIRA is good but nothing is
  perfect.  Contigs flanking repeats will typically have a higher depth of
  coverage at the terminal ends.  Interrogate these regions to find out
  what repeat you are dealing with.  It will also aid you in primer design.
  No point putting a primer at the end of a contig if it is likely to prime
  in multiple locations in the genome and give you multiple products.


  Scaffolding is also going to be required in order to orient your contigs
  and plan your PCR experiments.  If you don’t have paired end data to
  facilitate this you can use programs like Mauve, Mummer/Nucmer, Abacus,
  Projector2, etc. to order your contigs based on a reference sequence.
  Just keep in mind that the resulting contig order is only as good as the
  reference sequence (and program) you use so trying different reference
  sequences and looking for consistency is usually recommended.  In the end
  it probably won’t be perfect so expect some predictions to be wrong.
  Abacus and Projector2 are nice because they will provide you with a
  primer list for closing gaps but I sometimes question the ordering that
  they come up with and the primers designed are not always the best (e.g.
  they can target obvious repeat regions).


  Another tip has to do with the repeats.  In all likelihood you will have
  multiple copies of rRNA sequences or other large regions that will
  require primer walking.  If you are looking to close these gaps make use
  of the data you have to make things easier.  The rRNA regions for example
  will typically be 5 Kb or so and would require multiple rounds of
  sequence-design primer-sequence-design primer, etc.  But you can
  predesign the primers to cover these repetitive regions base on the data
  you have so that you can use the same sequencing primers for all of the
  different copies.


  I could go on and on about what to do in different situations but I think
  this covers the main areas.


  Good Luck and Have Fun. Oh and be prepared for frustration and
  disappointment ;-)


  Shaun




  Inactive hide details for Shankar Manoharan ---2012-05-06 01:05:31
  PM---*Dear all...* *     Firstly thanks a LOT for your earliShankar
  Manoharan ---2012-05-06 01:05:31 PM---*Dear all...* *     Firstly thanks
  a LOT for your earlier support. I have now managed to

  From: Shankar Manoharan <shankarmanostar@xxxxxxxxx>
  To: mira_talk@xxxxxxxxxxxxx
  Date: 2012-05-06 01:05 PM
  Subject: [mira_talk] Gap closure
  Sent by: mira_talk-bounce@xxxxxxxxxxxxx




  Dear all...
       Firstly thanks a LOT for your earlier support. I have now managed to
  assemble a ~4.5 Mb genome and obtain 48 contigs of which 14 are repeats.
  If I have to close gaps, what is the best possible way possible ? Any
  input would be greatly appreciated. Many thanks in advance.

  Best regards,

  Shankar Manoharan
  Graduate Student
  Department of Genetics
  Madurai Kamaraj University
  Ph. +919790167534

  I strongly believe in doing my best and leaving the rest to God






GIF image

Other related posts: