[mira_talk] Re: multiple bacteria strains in my sequencing run

From: Chris Hoefler <hoeflerb@xxxxxxxxx>
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Mon, 29 Jun 2015 12:35:00 -0500

<quote>If I had a sample which I knew had (say) 5 strains within, where
each strain had a different sequence for a gene, will Mira provide me with
5 separate assemblies (presuming each gene was distinct enough)?</quote>

Short answer: yes

Longer answer:
Mira is designed to distinguish between repeats with single nucleotide
variations. In the context of a single organism, Mira will assemble
repetitive regions into separate contigs if it detects differences in those
repeats. In the context of multiple organisms (or a single organism with
multiple chromosome copies), nearly identical contigs that differ by single
nucleotides and originate from different organisms (chromosomal copies)
will be assembled separately. The caveat to this is a combination of
coverage depth, coverage consistency, and sequencing errors. To distinguish
between sequencing error and true variations, Mira relies on kmer
frequencies which are heavily influenced by coverage variations. So if you
have too much coverage, too little coverage, or large differences in
coverage, repeats/variations can be missed or sequencing errors can be
called as repeats. So while in an ideal scenario you would get 5 assembled
genes for 5 organisms in a pool, in reality you will likely get more or
less than that.

That said, Mira will do everything it can to avoid misassemblies. So if
there is sufficient evidence of two non-identical gene copies, it won't
assemble them together. Mira also makes heavy use of tags to let you know
how it makes decisions regarding contig building and breaking. So
definitely look at the tags when you do your analysis (SRMc and SROr are
probably the ones to focus on the most).

There is a lot of good information in the manual about the tags and how
Mira makes decisions regarding potential repeats. Sections of particular
interest,
3.7 Tags used in the assembly by MIRA and EdIt.
3.8 Where reads end up: contigs, singlets, debris
3.9 Detection of bases distinguishing non-perfect repeats and SNP discovery
3.11.2 Ploidy and repeats
3.11.3 Handling of repeats
9.2 First look: the assembly info
9.5 Places of importance in a de-novo assembly

On Wed, Jun 24, 2015 at 3:35 PM, Scott Christley <schristley@xxxxxxx> wrote:

Hello,

I have an Illumina paired-end 2x150 sequencing run of about 30 million
reads for a wildtype bacteria sample. The sample came from a gut
microbiome and Enterococcus faecalis was extracted using a selection
culture plate. It is my belief that this sample actually contains a
mixture of multiple strains of E. faecalis. This is okay though, in fact
this is very much what I’m interested in. I want to be able to study this
natural mixture of strains and analyze the genomic variation. I have a
question about Mira’s output and whether my interpretation of the assembly
is correct. Also I’m curious if anybody has comments on my process.

I first aligned (bowtie2) all my reads to a reference genome, which was
about 70% of the reads. Then I took the unaligned reads and aligned them
to a set of plasmids, etc., to remove that stuff. Then the remaining
unaligned reads I gave to mira to assemble. The result is about 20k+
contigs, the default long contig filter gives a few hundred contigs. I’ve
gone and aligned many of these contigs to the reference genome, and quite a
few mapped to genes.

My question is, am I correct in assuming that these assemblies are valid
alternative sequences for genes? That is, they could be sequences for
other strains in my sample?

If I had a sample which I knew had (say) 5 strains within, where each
strain had a different sequence for a gene, will Mira provide me with 5
separate assemblies (presuming each gene was distinct enough)?

thanks!
Scott

--
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html

--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Follow-Ups:
- [mira_talk] Re: multiple bacteria strains in my sequencing run
  - From: Scott Christley

References:
- [mira_talk] multiple bacteria strains in my sequencing run
  - From: Scott Christley

[mira_talk] Re: multiple bacteria strains in my sequencing run

Other related posts: