[mira_talk] Re: one strain assembles better than a second similar strain?

  • From: "Bastien Chevreux" <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 20 Oct 2011 18:33:01 +0200 (MEST)

Von: Adam Witney<awitney@xxxxxxxxxx>
> I have Ion Torrent data from two strains of Pseudomonas (~6Mb genome). The
> datasets are similar in terms of numbers of reads, but the assembly seems to
> work better for the first strain than the second. Here are some of the stats
> (I have removed some lines for brevity):
> [...]
> My definition of 'better' by the way is higher N50, size of contigs, number
> of contigs etc. Other things I noticed were that the first strain assembled
> more quickly and used less temp disk space than the second.
> 
> My question is, is this just down to natural variation when sequencing two
> strains (although these two strains do look quite similar by BLAST etc), or
> is there something else in the data i should filter out before assembling?
> Note there may also be a plasmid, although I have not yet found any.

IonTorrent is still pretty new to MIRA and the technology is evolving / 
changing rapidly at the moment, so MIRA sometimes fails to deliver good 
assemblies for reasons unknown to me. E.g., the new, high-coverage public data 
for E.coli long reads at the Ion site does not really get assembled well.

However, very similar strains should give quite similar results. If not, then 
either

- it may be that they differ quite drastically in genome organisation
- the sequening data of one strain contains a sequencing artefact the other 
does not have
- one strain has been harvested under different conditions than the other. 
Growth comes to mind: for de-novo assemblies, the organism DNA really should be 
prepared while the strain / organism is not growing. Sometimes very closely 
related strain have totally different growth curves and when this is not 
checked, fun is almost guaranteed.
- MIRA has a bug

Anyway, there are a couple of clues in the information you posted: the coverage 
information. "Large contigs" in both strains have a coverage of ~20x, yet the 
first strain has a contig with a max coverage of 680x, while the other strain 
(the one with the longer assembly time) has a contig with max coverage of 790x.

In both cases the fold difference of 34 to 39 (ration between 20x and 
680x/790x) is a lot higher than I am used from "normal" bacteria, that would be 
my first angle of attack: what are these high coverage contigs, why does one 
strain seem to have a couple more than the other.

Second thing to look at: kmer repeat histogram (hash statistics) which you did 
not post but can tell you quite a bit.

Third thing: after the hash statistics, have a look at the read repeat info 
file, and there specially the stretches tagged MNRr. They can be quite 
informative regarding either sequencing artefacts (some kind of adaptor not 
clipped) or really high copy number stretches.

Best,
  Bastien


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: