[mira_talk] Re: High Coverage Mapping Assembly

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Fri, 26 Sep 2014 09:09:42 -0500

This sounds like a pretty standard workflow. The only thing I would
recommend is that you map each mutant to the reference separately, rather
than in one big job. Mira is designed to handle multiple strains, so it is
more my preference than a Mira limitation, but I think the data analysis
will be easier, especially if you are intending to look at things like copy
number variation.


I will not pay attention to the MIRA snp´s file


Mira's SNP tagging is in fact one of its strongest features, so I wouldn't
discount it. The tags provide a convenient way to look through the genome
for features. Sometimes sequencing error can confound the analysis, but
Mira is pretty good at letting you know what it thinks is going on and
letting you see for yourself. If you give Mira an annotated gbk or gff3 to
use as a reference, the annotations will be copied to the final assembly
file, which can be really handy for the downstream analysis.


On Fri, Sep 26, 2014 at 12:53 AM, Ivan Imaz Rosshandler <
iimaz@xxxxxxxxxxxxxx> wrote:

> Hi,
>
> I am working with Said in this project, I could not wait to give you a
> reply. Chris, thanks a lot for your detailed answer, it is very helpful.
> The size of the genome is 34Mb but memory problems are already solved.
> Adrian is right, the purpose of this strategy could help to narrow your
> explanation. We basicaly want to build a reference genome that allows us to
> create a single annotation and then use it separately for downstream
> analysis of mutations, copy number alterations and even use it for RNAseq
> analysis. We are expecting that this reference includes the best supported
> evidence in terms of reads mapping against the backbone sequence but not
> everything. Then by mapping each strain against the new reference and
> having a single annotation, I could perform snp or other analysis. I will
> not pay attention to the MIRA snp´s file for the reasons you have given to
> me.
>
> Am I doing something naive?
>
> Thanks!!
>
> ----- Mensaje original -----
> De: Chris Hoefler <hoeflerb@xxxxxxxxx>
> Para: mira talk <mira_talk@xxxxxxxxxxxxx>
> Enviado: Thu, 25 Sep 2014 21:32:05 -0500 (CDT)
> Asunto: [mira_talk] Re: High Coverage Mapping Assembly
>
> You didn't say how large the genome is, so I'll guess high at ~18Mb? I've
> played with mapping multiple strains simultaneously, but I have concluded
> that it usually isn't very helpful for a few reasons.
>
> 1) While getting a memory usage estimate is fairly straightforward, CPU
> time is harder to predict. In my experience, mapping two strains separately
> in different processes is faster than mapping two strains simultaneously.
> While that may be counter-intuitive, it actually makes sense when you
> consider that Mira does some things multi-threaded and other things
> single-threaded. So with twice the work in a single thread, a batch job
> will take longer than single jobs each in their own process. Even so, you
> might be ok with the longer execution time for the convenience of one
> manifest file and one output assembly that you can then analyze (more on
> this below). For two strains that might be fine, but for nine strains at
> 18Mb per strain, this will take a very long time and is probably not worth
> it.
>
> 2) Consider Bastien's first point from the manual. Only here you are not
> dealing with cell-to-cell variability, you are dealing with
> strain-to-strain variability. In other words, the problem is amplified. You
> won't get contig breaking in the same way as you would with a denovo
> assembly, but Mira uses sequence context from neighboring reads to better
> deal with sequencing errors, so when you have subsets of reads with SNPs in
> different places, this may be problematic for the read mapping and/or
> feature tagging. The more strains the more problematic this can be. If you
> have larger differences--like deletions or rearrangements--you may miss
> those entirely.
>
> 3) Best case scenario, all the reads map to the right places, and you can
> conceivably detect SNPs. How will you analyze this? The SNP feature tags
> that Mira uses will apply to the entire assembly, not to individual
> strains. So while you can look at a single strain's consensus, you will
> have to find the SNPs manually. In some cases you might find special tags,
> such as IUPc or STMU, when two strains disagree about a SNP. You can
> quickly find these areas and then figure out what is going on by manual
> inspection. But this will be a much more complicated problem with nine
> strains. And again, if you have larger deletions or rearrangements in some
> of these strains, you may not see them without going through the debris
> file and trying to piece together what happened.
>
> Bottom line: I would just do each mapping separately. If you have a
> cluster that you can submit jobs on, just write a quick bash script to
> submit them one at a time in succession. You can analyze the jobs that
> finish first while the later jobs are still running.
>
>
> > On Sep 25, 2014, at 7:39 PM, Said Muñoz Montero <said3427@xxxxxxxxx>
> wrote:
> >
> > Hello Mira experts,
> >
> > I am doing a mapping assembly with 9 different parasite isolates
> simultaneously using a reference genome from the same specie. The genome
> variability between samples is low, except for copy number variation.  The
> total coverage after combining all samples is 360X so I changed the
> -NW:cac=stop parameter.
> >
> > I have read the warnings about similar tasks in MIRA mailing list but
> these are referred to a denovo assembly. Despite the computational
> resources needed. What do you think about these strategy?
> >
> > I would really appreciate any advice!
> >
> > Here are the warnings given by Bastien in the Mira Guide:
> >
> > "With todays' sequencing technologies (especially Illumina, but also Ion
> > Torrent and 454), many people simply take everything they get and throw
> it
> > into an assembly. Which, in the case of Illumina and Ion, can mean they
> try
> > to assemble their organism with a coverage of 100x, 200x and more (I've
> > seen trials with more than 1000x).
> >
> > This is not good. Not. At. All! For two reasons (well, three to be
> precise).
> > The first reason is that, usually, one does not sequence a single cell
> but
> > a population of cells. If this population is not clonal (i.e., it
> contains
> > subpopulations with genomic differences with each other), assemblers will
> > be able to pick up these differences in the DNA once a certain sequence
> > count is reached and they will try reconstruct a genome containing all
> > clonal variations, treating these variations as potential repeats with
> > slightly different sequences. Which, of course, will be wrong and I am
> > pretty sure you do not want that.
> >
> > The second and way more important reason is that none of the current
> > sequencing technologies is completely error free. Even more problematic,
> > they contain both random and non-random sequencing errors. Especially the
> > latter can become a big hurdle if these non-random errors are so
> prevalent
> > that they suddenly appear to be valid sequence to an assembler. This in
> > turn leads to false repeat detection, hence possibly contig breaks or
> even
> > wrong consensus sequence. You don't want that, do you?
> >
> > The last reason is that overlap based assemblers (like MIRA is) need
> > *exponentially* more time and memory when the coverage increases. So
> > keeping the coverage comparatively low helps you there."
> >
> > THANKS!!!
>
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>



-- 
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: