[mira_talk] RfC: Getting mean genomes assembled

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Mon, 15 Jun 2009 23:17:05 +0200
Dear all,

the following is a guide I didn't have time to write up yet, but questions 
concerning this pop-up quite often now and I've quickly made a first draft. 
Please feel free to comment if anything is unclear.

Furthermore, for those of you who have eukaryotic or plant genomes and 
assembled them with version 2.9.45 or later: may I kindly ask you to send me 
the part of the log file which contains hash statistics? I'd add these as 
examples so that, with time, a small collection comes together to help people 
out in spotting problems.

Bastien

------------------------------------------------------------------------------

  Table of Contents

     * [1]Getting 'mean' genomes assembled
          + [2]For the impatient
          + [3]Introduction to 'masking'
          + [4]How does "nasty repeat" masking work?
          + [5]Selecting a "nasty repeat ratio"
     * [6]Finding out repetitive parts in reads
     * [7]Seeing masked stretches in assembly editors
     * [8]Examples for hash statistics
          + [9]Caveat: -SK:bph
          + [10]Sanger sequencing, a simple bacterium
          + [11]454 Sequencing, a somewhat more complex bacterium
          + [12]Solexa sequencing, E.coli MG1655
          + [13](NEED EXAMPLES FOR EUKARYOTES)
          + [14](NEED EXAMPLES FOR PATHOLOGICAL CASES)

Getting 'mean' genomes assembled

   For some genomes you might want to assemble, MIRA will take too long or
   the available memory will not be sufficient. This can be the case for
   eukaryotes, plants, but also for some bacteria which contain high
   number of (pro-)phages, plasmids or engineered operons.

   This guide is intended to get you through these problematic genomes. It
   is (cannot be) exhaustive, but it should get you going.

    For the impatient

   Use -SK:mnr=yes:nrr=10 and give it a try. If that does not work,
   decrease -SK:nrr to anywhere between 5 and 9. If it worked well enough
   increase the -SK:nrr parameter up to 15 or 20. But please also read on
   to see how to choose the "nrr" threshold.

    Introduction to 'masking'

   The SKIM phase (all-against-all comparison) will report almost every
   potential hit to be checked with Smith-Waterman further downstream in
   the MIRA assembly process. While this is absolutely no problem for most
   bacteria, some genomes (eukaryotes, plants, some bacteria) have so many
   closely related sequences (repeats) that the data structures needed to
   take up all information might get much larger than your available
   memory. In those cases, your only chance to still get an assembly is to
   tell the assembler it should disregard extremely repetitive features of
   your genome.

   There is, in most cases, one problem: one doesn't know beforehand which
   parts of the genome are extremely repetitive. But MIRA can help you
   here as it produces most of the needed information during assembly and
   you just need to choose a threshold from where on MIRA won't care about
   repetitive matches.

   The key to this are the two fail-safe command line parameters which
   will mask "nasty" repeats from the quick overlap finder (SKIM): -SK:mnr
   and -SK:nrr=10. -SK:bph also plays a role in this, but I'll come back
   to this later).

    How does "nasty repeat" masking work?

   If switched on -SK:mnr=yes, MIRA will use SKIM3 k-mer statistics to
   find repetitive stretches. K-mers are nucleotide stretches of length k.
   In a perfectly sequenced genome without any sequencing error and
   without sequencing bias, the k-mer frequency can be used to assess how
   many times a given nucleotide stretch is present in the genome: if a
   specific k-mer is present as many times as the average frequency of all
   k-mers, it is a reasonable assumption to estimate that the specific
   k-mer is not part of a repeat (at least not in this genome).

   Following the same path of thinking, if a specific k-mer frequency is
   now two times higher than the average of all k-mers, one would assume
   that this specific k-mer is part of a repeat which occurs exactly two
   times in the genome. For 3x k-mer frequency, a repeat is present three
   times. Etc.pp. MIRA will merge information on single k-mers frequency
   into larger 'repeat' stretches and tag these stretches accordingly.

   Of course, low-complexity nucleotide stretches (like poly-A in
   eukaryotes), sequencing errors in reads and non-uniform distribution of
   reads in a sequencing project will weaken the initial assumption that a
   k-mer frequency is representative for repeat status. But even then the
   k-mer frequency model works quite well and will give a pretty good
   overall picture: most repeats will be tagged as such.

   Note that the parts of reads tagged as "nasty repeat" will not get
   masked per se, the sequence will still be present. The stretches dubbed
   repetitive will get the "MNRr" tag. They will still be used in
   Smith-Waterman overlaps and will generate a correct consensus if
   included in an alignment, but they will not be used as seed.

   Some reads will invariably end up being completely repetitive. These
   will not be assembled into contigs as MIRA will not see overlaps as
   they'll be completely masked away. These reads will end up as debris.
   However, note that MIRA is pretty good at discerning 100% matching
   repeats from repeats which are not 100% matching: if there's a single
   base with which repeats can be discerned from each other, MIRA will
   find this base and use the k-mers covering that base to find overlaps.

    Selecting a "nasty repeat ratio"

   The ratio from which on the MIRA SKIM algorithm won't report matches is
   set via -SK:nrr. E.g., using -SK:nrr=10 will hide all k-mers which
   occur at a frequency 10 times (or more) higher than the median of all
   k-mers.

   The nastiness of a repeat is difficult to judge, but starting with 10
   copies in a genome, things can get complicated. At 20 copies, you'll
   have some troubles for sure.

   The standard values of 10 for the -SK:nrr parameter is a pretty good
   'standard' value which can be tried for an assembly before trying to
   optimize it via studying the hash statistics calculated by MIRA. For
   the later, please read the section 'Examples for hash statistics'
   further down in this guide.

Finding out repetitive parts in reads

   If -SK:mnr=yes is used, MIRA will write an additional file into the log
   directory: _int_skimmarknastyrepeats_nastyseq_preassembly.0.lst

   The "nastyseq" file makes it possible to try and find out what makes
   sequencing data nasty. It's a key-value file with the name of the
   sequence as "key" and the nasty sequence as "value". "Nasty" in this
   case means "everything which was masked via -SKmnr=yes.

   The file looks like this:

sequence1     GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
sequence2     CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
sequence3     GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
etc.

   One will need to search some databases with the "nasty" sequences and
   find vector sequences, adaptor sequences or even human sequences in
   bacterial or plant genomes ... or vice versa as this type of
   contamination happens quite easily with data from new sequencing
   technologies. After a while one gets a feeling what constitutes the
   largest part of the problem and one can start to think of taking
   countermeasures like filtering, clipping, masking etc.

Seeing masked stretches in assembly editors

   During SKIM phase, MIRA will assign frequency information to each and
   every k-mer of all reads in a sequencing project, giving them different
   status. Additionally, tags are set in the reads so that one can assess
   reads in assembly editors that understand tags (like gap4, gap5, consed
   etc.). The following tags are used:
   HAF2
       below average (<0.5 times average)
   HAF3
       average (>=0.5 times average and <= 1.5 times average)
   HAF4
       above average (>1.5 times average and < 2 times average)
   HAF5
       repeat (>=2 times average and < 5 times average)
   HAF6
       'crazy' repeat (> 5 times average)
   MNRr
       stretches which were masked away by -SK:mnr=yes

Examples for hash statistics

   Selecting the right ratio so that an assembly fits into your memory is
   not straight forward. But MIRA can help you a bit: during assembly,
   some frequency statistics are printed out (they'll probably end up in
   some info file in later releases). Search for the term "Hash
   statistics" in the information printed out by MIRA (this happens quite
   early in the process)

    Caveat: -SK:bph

   Some explanation how bph affects the statistics and why it should be
   chosen >=16 for -SK:mnr

    Sanger sequencing, a simple bacterium

   This example is taken from a pretty standard bacterium where Sanger
   sequencing was used:

Hash statistics:
=========================================================
Measured avg. coverage: 15

Deduced thresholds:
-------------------
Min normal cov: 7
Max normal cov: 23
Repeat cov: 29
Crazy cov: 120
Mask cov: 150

Repeat ratio histogram:
-----------------------
0       475191
1       5832419
2       181994
3       6052
4       4454
5       972
6       4
7       8
14      2
16      10
=========================================================

   The above can be interpreted like this: the expected coverage of the
   genome is 15x. Starting with an estimated hash frequency of 29, MIRA
   will treat a k-mer as 'repetitive'. As shown in the histogram, the
   overall picture of this project is pretty healthy:
     * only a small fraction of k-mers have a repeat level of '0' (these
       would be k-mers in regions with quite low coverage or k-mers
       containing sequencing errors)
     * the vast majority of k-mers have a repeat level of 1 (so that's
       non- repetitive coverage)
     * there is a small fraction of k-mers with repeat level of 2-10
     * there are almost no k-mers with a repeat level >10

    454 Sequencing, a somewhat more complex bacterium

   Here's in comparison a profile for a more complicated bacterium (454
   sequencing):

Hash statistics:
=========================================================
Measured avg. coverage: 20

Deduced thresholds:
-------------------
Min normal cov: 10
Max normal cov: 30
Repeat cov: 38
Crazy cov: 160
Mask cov: 0

Repeat ratio histogram:
-----------------------
0       8292273
1       6178063
2       692642
3       55390
4       10471
5       6326
6       5568
7       3850
8       2472
9       708
10      464
11      270
12      140
13      136
14      116
15      64
16      54
17      54
18      52
19      50
20      58
21      36
22      40
23      26
24      46
25      42
26      44
27      32
28      38
29      44
30      42
31      62
32      116
33      76
34      80
35      82
36      142
37      100
38      120
39      94
40      196
41      172
42      228
43      226
44      214
45      164
46      168
47      122
48      116
49      98
50      38
51      56
52      22
53      14
54      8
55      2
56      2
57      4
87      2
89      6
90      2
92      2
93      2
1177    2
1181    2
=========================================================

   The difference to the first bacterium shown is pretty striking:
     * first, the k-mers in repeat level 0 (below average) is higher than
       the k-mers of level 1! This points to a higher number of sequencing
       errors in the 454 reads than in the Sanger project shown
       previously. Or at a more uneven distribution of reads (but not in
       this special case).
     * second, the repeat level histogram does not trail of at a repeat
       frequency of 10 or 15, but it has a long tail up to the fifties,
       even having a local maximum at 42. This points to a small part of
       the genome being heavily repetitive ... or to (a) plasmid(s) in
       high copy numbers.

   Should MIRA ever have problems with this genome, switch on the nasty
   repeat masking and use a level of 15 as cutoff. In this case, 15 is OK
   to start with as a) it's a bacterium, it can't be that hard and b) the
   frequencies above level 5 are in the low thousands and not in the tens
   of thousands.

    Solexa sequencing, E.coli MG1655

Hash statistics:
=========================================================
Measured avg. coverage: 23

Deduced thresholds:
-------------------
Min normal cov: 11
Max normal cov: 35
Repeat cov: 44
Crazy cov: 184
Mask cov: 0

Repeat ratio histogram:
-----------------------
0       1365693
1       8627974
2       157220
3       11086
4       4990
5       3512
6       3922
7       4904
8       3100
9       1106
10      868
11      788
12      400
13      186
14      28
15      10
16      12
17      4
18      4
19      2
20      14
21      8
25      2
26      8
27      2
28      4
30      2
31      2
36      4
37      6
39      4
40      2
45      2
46      8
47      14
48      8
49      4
50      2
53      2
56      6
59      4
62      2
63      2
67      2
68      2
70      2
73      4
75      2
77      4
=========================================================

   This hash statistics shows that MG1655 is pretty boring (from a
   repetitive point of view). One might expect a few repeats but nothing
   fancy: The repeats are actually the rRNA and sRNA stretches in the
   genome plus some intergenic regions.
     * the k-mers number in repeat level 0 (below average) is considerably
       lower than the level 1, so the Solexa sequencing quality is pretty
       good respectively there shouldn't be too many low coverage areas.
     * the histogram tail shows some faint traces of possibly highly
       repetitive k-mers, but these are false positive matches due to some
       standard Solexa base-calling weaknesses of earlier pipelines like,
       e.g., adding poly-A, poly-T or sometimes poly-C and poly-G tails to
       reads when spots in the images were faint and the base calls of bad
       quality

    (NEED EXAMPLES FOR EUKARYOTES)


    (NEED EXAMPLES FOR PATHOLOGICAL CASES)

   Vector contamination etc.



-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
References:
- [mira_talk] denovo assembly runs out of memory
  - From: Andreas Petzold
[mira_talk] RfC: Getting mean genomes assembled

Other related posts: