Von: Adam Witney<awitney@xxxxxxxxxx> > I have Ion Torrent data from two strains of Pseudomonas (~6Mb genome). The > datasets are similar in terms of numbers of reads, but the assembly seems to > work better for the first strain than the second. Here are some of the stats > (I have removed some lines for brevity): > [...] > My definition of 'better' by the way is higher N50, size of contigs, number > of contigs etc. Other things I noticed were that the first strain assembled > more quickly and used less temp disk space than the second. > > My question is, is this just down to natural variation when sequencing two > strains (although these two strains do look quite similar by BLAST etc), or > is there something else in the data i should filter out before assembling? > Note there may also be a plasmid, although I have not yet found any. IonTorrent is still pretty new to MIRA and the technology is evolving / changing rapidly at the moment, so MIRA sometimes fails to deliver good assemblies for reasons unknown to me. E.g., the new, high-coverage public data for E.coli long reads at the Ion site does not really get assembled well. However, very similar strains should give quite similar results. If not, then either - it may be that they differ quite drastically in genome organisation - the sequening data of one strain contains a sequencing artefact the other does not have - one strain has been harvested under different conditions than the other. Growth comes to mind: for de-novo assemblies, the organism DNA really should be prepared while the strain / organism is not growing. Sometimes very closely related strain have totally different growth curves and when this is not checked, fun is almost guaranteed. - MIRA has a bug Anyway, there are a couple of clues in the information you posted: the coverage information. "Large contigs" in both strains have a coverage of ~20x, yet the first strain has a contig with a max coverage of 680x, while the other strain (the one with the longer assembly time) has a contig with max coverage of 790x. In both cases the fold difference of 34 to 39 (ration between 20x and 680x/790x) is a lot higher than I am used from "normal" bacteria, that would be my first angle of attack: what are these high coverage contigs, why does one strain seem to have a couple more than the other. Second thing to look at: kmer repeat histogram (hash statistics) which you did not post but can tell you quite a bit. Third thing: after the hash statistics, have a look at the read repeat info file, and there specially the stretches tagged MNRr. They can be quite informative regarding either sequencing artefacts (some kind of adaptor not clipped) or really high copy number stretches. Best, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html