[mira_talk] RfC: Getting mean genomes assembled
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Mon, 15 Jun 2009 23:17:05 +0200
Dear all,
the following is a guide I didn't have time to write up yet, but questions
concerning this pop-up quite often now and I've quickly made a first draft.
Please feel free to comment if anything is unclear.
Furthermore, for those of you who have eukaryotic or plant genomes and
assembled them with version 2.9.45 or later: may I kindly ask you to send me
the part of the log file which contains hash statistics? I'd add these as
examples so that, with time, a small collection comes together to help people
out in spotting problems.
Bastien
------------------------------------------------------------------------------
Table of Contents
* [1]Getting 'mean' genomes assembled
+ [2]For the impatient
+ [3]Introduction to 'masking'
+ [4]How does "nasty repeat" masking work?
+ [5]Selecting a "nasty repeat ratio"
* [6]Finding out repetitive parts in reads
* [7]Seeing masked stretches in assembly editors
* [8]Examples for hash statistics
+ [9]Caveat: -SK:bph
+ [10]Sanger sequencing, a simple bacterium
+ [11]454 Sequencing, a somewhat more complex bacterium
+ [12]Solexa sequencing, E.coli MG1655
+ [13](NEED EXAMPLES FOR EUKARYOTES)
+ [14](NEED EXAMPLES FOR PATHOLOGICAL CASES)
Getting 'mean' genomes assembled
For some genomes you might want to assemble, MIRA will take too long or
the available memory will not be sufficient. This can be the case for
eukaryotes, plants, but also for some bacteria which contain high
number of (pro-)phages, plasmids or engineered operons.
This guide is intended to get you through these problematic genomes. It
is (cannot be) exhaustive, but it should get you going.
For the impatient
Use -SK:mnr=yes:nrr=10 and give it a try. If that does not work,
decrease -SK:nrr to anywhere between 5 and 9. If it worked well enough
increase the -SK:nrr parameter up to 15 or 20. But please also read on
to see how to choose the "nrr" threshold.
Introduction to 'masking'
The SKIM phase (all-against-all comparison) will report almost every
potential hit to be checked with Smith-Waterman further downstream in
the MIRA assembly process. While this is absolutely no problem for most
bacteria, some genomes (eukaryotes, plants, some bacteria) have so many
closely related sequences (repeats) that the data structures needed to
take up all information might get much larger than your available
memory. In those cases, your only chance to still get an assembly is to
tell the assembler it should disregard extremely repetitive features of
your genome.
There is, in most cases, one problem: one doesn't know beforehand which
parts of the genome are extremely repetitive. But MIRA can help you
here as it produces most of the needed information during assembly and
you just need to choose a threshold from where on MIRA won't care about
repetitive matches.
The key to this are the two fail-safe command line parameters which
will mask "nasty" repeats from the quick overlap finder (SKIM): -SK:mnr
and -SK:nrr=10. -SK:bph also plays a role in this, but I'll come back
to this later).
How does "nasty repeat" masking work?
If switched on -SK:mnr=yes, MIRA will use SKIM3 k-mer statistics to
find repetitive stretches. K-mers are nucleotide stretches of length k.
In a perfectly sequenced genome without any sequencing error and
without sequencing bias, the k-mer frequency can be used to assess how
many times a given nucleotide stretch is present in the genome: if a
specific k-mer is present as many times as the average frequency of all
k-mers, it is a reasonable assumption to estimate that the specific
k-mer is not part of a repeat (at least not in this genome).
Following the same path of thinking, if a specific k-mer frequency is
now two times higher than the average of all k-mers, one would assume
that this specific k-mer is part of a repeat which occurs exactly two
times in the genome. For 3x k-mer frequency, a repeat is present three
times. Etc.pp. MIRA will merge information on single k-mers frequency
into larger 'repeat' stretches and tag these stretches accordingly.
Of course, low-complexity nucleotide stretches (like poly-A in
eukaryotes), sequencing errors in reads and non-uniform distribution of
reads in a sequencing project will weaken the initial assumption that a
k-mer frequency is representative for repeat status. But even then the
k-mer frequency model works quite well and will give a pretty good
overall picture: most repeats will be tagged as such.
Note that the parts of reads tagged as "nasty repeat" will not get
masked per se, the sequence will still be present. The stretches dubbed
repetitive will get the "MNRr" tag. They will still be used in
Smith-Waterman overlaps and will generate a correct consensus if
included in an alignment, but they will not be used as seed.
Some reads will invariably end up being completely repetitive. These
will not be assembled into contigs as MIRA will not see overlaps as
they'll be completely masked away. These reads will end up as debris.
However, note that MIRA is pretty good at discerning 100% matching
repeats from repeats which are not 100% matching: if there's a single
base with which repeats can be discerned from each other, MIRA will
find this base and use the k-mers covering that base to find overlaps.
Selecting a "nasty repeat ratio"
The ratio from which on the MIRA SKIM algorithm won't report matches is
set via -SK:nrr. E.g., using -SK:nrr=10 will hide all k-mers which
occur at a frequency 10 times (or more) higher than the median of all
k-mers.
The nastiness of a repeat is difficult to judge, but starting with 10
copies in a genome, things can get complicated. At 20 copies, you'll
have some troubles for sure.
The standard values of 10 for the -SK:nrr parameter is a pretty good
'standard' value which can be tried for an assembly before trying to
optimize it via studying the hash statistics calculated by MIRA. For
the later, please read the section 'Examples for hash statistics'
further down in this guide.
Finding out repetitive parts in reads
If -SK:mnr=yes is used, MIRA will write an additional file into the log
directory: _int_skimmarknastyrepeats_nastyseq_preassembly.0.lst
The "nastyseq" file makes it possible to try and find out what makes
sequencing data nasty. It's a key-value file with the name of the
sequence as "key" and the nasty sequence as "value". "Nasty" in this
case means "everything which was masked via -SKmnr=yes.
The file looks like this:
sequence1 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
sequence2 CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
sequence3 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
etc.
One will need to search some databases with the "nasty" sequences and
find vector sequences, adaptor sequences or even human sequences in
bacterial or plant genomes ... or vice versa as this type of
contamination happens quite easily with data from new sequencing
technologies. After a while one gets a feeling what constitutes the
largest part of the problem and one can start to think of taking
countermeasures like filtering, clipping, masking etc.
Seeing masked stretches in assembly editors
During SKIM phase, MIRA will assign frequency information to each and
every k-mer of all reads in a sequencing project, giving them different
status. Additionally, tags are set in the reads so that one can assess
reads in assembly editors that understand tags (like gap4, gap5, consed
etc.). The following tags are used:
HAF2
below average (<0.5 times average)
HAF3
average (>=0.5 times average and <= 1.5 times average)
HAF4
above average (>1.5 times average and < 2 times average)
HAF5
repeat (>=2 times average and < 5 times average)
HAF6
'crazy' repeat (> 5 times average)
MNRr
stretches which were masked away by -SK:mnr=yes
Examples for hash statistics
Selecting the right ratio so that an assembly fits into your memory is
not straight forward. But MIRA can help you a bit: during assembly,
some frequency statistics are printed out (they'll probably end up in
some info file in later releases). Search for the term "Hash
statistics" in the information printed out by MIRA (this happens quite
early in the process)
Caveat: -SK:bph
Some explanation how bph affects the statistics and why it should be
chosen >=16 for -SK:mnr
Sanger sequencing, a simple bacterium
This example is taken from a pretty standard bacterium where Sanger
sequencing was used:
Hash statistics:
=========================================================
Measured avg. coverage: 15
Deduced thresholds:
-------------------
Min normal cov: 7
Max normal cov: 23
Repeat cov: 29
Crazy cov: 120
Mask cov: 150
Repeat ratio histogram:
-----------------------
0 475191
1 5832419
2 181994
3 6052
4 4454
5 972
6 4
7 8
14 2
16 10
=========================================================
The above can be interpreted like this: the expected coverage of the
genome is 15x. Starting with an estimated hash frequency of 29, MIRA
will treat a k-mer as 'repetitive'. As shown in the histogram, the
overall picture of this project is pretty healthy:
* only a small fraction of k-mers have a repeat level of '0' (these
would be k-mers in regions with quite low coverage or k-mers
containing sequencing errors)
* the vast majority of k-mers have a repeat level of 1 (so that's
non- repetitive coverage)
* there is a small fraction of k-mers with repeat level of 2-10
* there are almost no k-mers with a repeat level >10
454 Sequencing, a somewhat more complex bacterium
Here's in comparison a profile for a more complicated bacterium (454
sequencing):
Hash statistics:
=========================================================
Measured avg. coverage: 20
Deduced thresholds:
-------------------
Min normal cov: 10
Max normal cov: 30
Repeat cov: 38
Crazy cov: 160
Mask cov: 0
Repeat ratio histogram:
-----------------------
0 8292273
1 6178063
2 692642
3 55390
4 10471
5 6326
6 5568
7 3850
8 2472
9 708
10 464
11 270
12 140
13 136
14 116
15 64
16 54
17 54
18 52
19 50
20 58
21 36
22 40
23 26
24 46
25 42
26 44
27 32
28 38
29 44
30 42
31 62
32 116
33 76
34 80
35 82
36 142
37 100
38 120
39 94
40 196
41 172
42 228
43 226
44 214
45 164
46 168
47 122
48 116
49 98
50 38
51 56
52 22
53 14
54 8
55 2
56 2
57 4
87 2
89 6
90 2
92 2
93 2
1177 2
1181 2
=========================================================
The difference to the first bacterium shown is pretty striking:
* first, the k-mers in repeat level 0 (below average) is higher than
the k-mers of level 1! This points to a higher number of sequencing
errors in the 454 reads than in the Sanger project shown
previously. Or at a more uneven distribution of reads (but not in
this special case).
* second, the repeat level histogram does not trail of at a repeat
frequency of 10 or 15, but it has a long tail up to the fifties,
even having a local maximum at 42. This points to a small part of
the genome being heavily repetitive ... or to (a) plasmid(s) in
high copy numbers.
Should MIRA ever have problems with this genome, switch on the nasty
repeat masking and use a level of 15 as cutoff. In this case, 15 is OK
to start with as a) it's a bacterium, it can't be that hard and b) the
frequencies above level 5 are in the low thousands and not in the tens
of thousands.
Solexa sequencing, E.coli MG1655
Hash statistics:
=========================================================
Measured avg. coverage: 23
Deduced thresholds:
-------------------
Min normal cov: 11
Max normal cov: 35
Repeat cov: 44
Crazy cov: 184
Mask cov: 0
Repeat ratio histogram:
-----------------------
0 1365693
1 8627974
2 157220
3 11086
4 4990
5 3512
6 3922
7 4904
8 3100
9 1106
10 868
11 788
12 400
13 186
14 28
15 10
16 12
17 4
18 4
19 2
20 14
21 8
25 2
26 8
27 2
28 4
30 2
31 2
36 4
37 6
39 4
40 2
45 2
46 8
47 14
48 8
49 4
50 2
53 2
56 6
59 4
62 2
63 2
67 2
68 2
70 2
73 4
75 2
77 4
=========================================================
This hash statistics shows that MG1655 is pretty boring (from a
repetitive point of view). One might expect a few repeats but nothing
fancy: The repeats are actually the rRNA and sRNA stretches in the
genome plus some intergenic regions.
* the k-mers number in repeat level 0 (below average) is considerably
lower than the level 1, so the Solexa sequencing quality is pretty
good respectively there shouldn't be too many low coverage areas.
* the histogram tail shows some faint traces of possibly highly
repetitive k-mers, but these are false positive matches due to some
standard Solexa base-calling weaknesses of earlier pipelines like,
e.g., adding poly-A, poly-T or sometimes poly-C and poly-G tails to
reads when spots in the images were faint and the base calls of bad
quality
(NEED EXAMPLES FOR EUKARYOTES)
(NEED EXAMPLES FOR PATHOLOGICAL CASES)
Vector contamination etc.
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts:
- » [mira_talk] RfC: Getting mean genomes assembled - Bastien Chevreux