Dear all, the following is a guide I didn't have time to write up yet, but questions concerning this pop-up quite often now and I've quickly made a first draft. Please feel free to comment if anything is unclear. Furthermore, for those of you who have eukaryotic or plant genomes and assembled them with version 2.9.45 or later: may I kindly ask you to send me the part of the log file which contains hash statistics? I'd add these as examples so that, with time, a small collection comes together to help people out in spotting problems. Bastien ------------------------------------------------------------------------------ Table of Contents * [1]Getting 'mean' genomes assembled + [2]For the impatient + [3]Introduction to 'masking' + [4]How does "nasty repeat" masking work? + [5]Selecting a "nasty repeat ratio" * [6]Finding out repetitive parts in reads * [7]Seeing masked stretches in assembly editors * [8]Examples for hash statistics + [9]Caveat: -SK:bph + [10]Sanger sequencing, a simple bacterium + [11]454 Sequencing, a somewhat more complex bacterium + [12]Solexa sequencing, E.coli MG1655 + [13](NEED EXAMPLES FOR EUKARYOTES) + [14](NEED EXAMPLES FOR PATHOLOGICAL CASES) Getting 'mean' genomes assembled For some genomes you might want to assemble, MIRA will take too long or the available memory will not be sufficient. This can be the case for eukaryotes, plants, but also for some bacteria which contain high number of (pro-)phages, plasmids or engineered operons. This guide is intended to get you through these problematic genomes. It is (cannot be) exhaustive, but it should get you going. For the impatient Use -SK:mnr=yes:nrr=10 and give it a try. If that does not work, decrease -SK:nrr to anywhere between 5 and 9. If it worked well enough increase the -SK:nrr parameter up to 15 or 20. But please also read on to see how to choose the "nrr" threshold. Introduction to 'masking' The SKIM phase (all-against-all comparison) will report almost every potential hit to be checked with Smith-Waterman further downstream in the MIRA assembly process. While this is absolutely no problem for most bacteria, some genomes (eukaryotes, plants, some bacteria) have so many closely related sequences (repeats) that the data structures needed to take up all information might get much larger than your available memory. In those cases, your only chance to still get an assembly is to tell the assembler it should disregard extremely repetitive features of your genome. There is, in most cases, one problem: one doesn't know beforehand which parts of the genome are extremely repetitive. But MIRA can help you here as it produces most of the needed information during assembly and you just need to choose a threshold from where on MIRA won't care about repetitive matches. The key to this are the two fail-safe command line parameters which will mask "nasty" repeats from the quick overlap finder (SKIM): -SK:mnr and -SK:nrr=10. -SK:bph also plays a role in this, but I'll come back to this later). How does "nasty repeat" masking work? If switched on -SK:mnr=yes, MIRA will use SKIM3 k-mer statistics to find repetitive stretches. K-mers are nucleotide stretches of length k. In a perfectly sequenced genome without any sequencing error and without sequencing bias, the k-mer frequency can be used to assess how many times a given nucleotide stretch is present in the genome: if a specific k-mer is present as many times as the average frequency of all k-mers, it is a reasonable assumption to estimate that the specific k-mer is not part of a repeat (at least not in this genome). Following the same path of thinking, if a specific k-mer frequency is now two times higher than the average of all k-mers, one would assume that this specific k-mer is part of a repeat which occurs exactly two times in the genome. For 3x k-mer frequency, a repeat is present three times. Etc.pp. MIRA will merge information on single k-mers frequency into larger 'repeat' stretches and tag these stretches accordingly. Of course, low-complexity nucleotide stretches (like poly-A in eukaryotes), sequencing errors in reads and non-uniform distribution of reads in a sequencing project will weaken the initial assumption that a k-mer frequency is representative for repeat status. But even then the k-mer frequency model works quite well and will give a pretty good overall picture: most repeats will be tagged as such. Note that the parts of reads tagged as "nasty repeat" will not get masked per se, the sequence will still be present. The stretches dubbed repetitive will get the "MNRr" tag. They will still be used in Smith-Waterman overlaps and will generate a correct consensus if included in an alignment, but they will not be used as seed. Some reads will invariably end up being completely repetitive. These will not be assembled into contigs as MIRA will not see overlaps as they'll be completely masked away. These reads will end up as debris. However, note that MIRA is pretty good at discerning 100% matching repeats from repeats which are not 100% matching: if there's a single base with which repeats can be discerned from each other, MIRA will find this base and use the k-mers covering that base to find overlaps. Selecting a "nasty repeat ratio" The ratio from which on the MIRA SKIM algorithm won't report matches is set via -SK:nrr. E.g., using -SK:nrr=10 will hide all k-mers which occur at a frequency 10 times (or more) higher than the median of all k-mers. The nastiness of a repeat is difficult to judge, but starting with 10 copies in a genome, things can get complicated. At 20 copies, you'll have some troubles for sure. The standard values of 10 for the -SK:nrr parameter is a pretty good 'standard' value which can be tried for an assembly before trying to optimize it via studying the hash statistics calculated by MIRA. For the later, please read the section 'Examples for hash statistics' further down in this guide. Finding out repetitive parts in reads If -SK:mnr=yes is used, MIRA will write an additional file into the log directory: _int_skimmarknastyrepeats_nastyseq_preassembly.0.lst The "nastyseq" file makes it possible to try and find out what makes sequencing data nasty. It's a key-value file with the name of the sequence as "key" and the nasty sequence as "value". "Nasty" in this case means "everything which was masked via -SKmnr=yes. The file looks like this: sequence1 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ... sequence2 CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ... sequence3 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ... etc. One will need to search some databases with the "nasty" sequences and find vector sequences, adaptor sequences or even human sequences in bacterial or plant genomes ... or vice versa as this type of contamination happens quite easily with data from new sequencing technologies. After a while one gets a feeling what constitutes the largest part of the problem and one can start to think of taking countermeasures like filtering, clipping, masking etc. Seeing masked stretches in assembly editors During SKIM phase, MIRA will assign frequency information to each and every k-mer of all reads in a sequencing project, giving them different status. Additionally, tags are set in the reads so that one can assess reads in assembly editors that understand tags (like gap4, gap5, consed etc.). The following tags are used: HAF2 below average (<0.5 times average) HAF3 average (>=0.5 times average and <= 1.5 times average) HAF4 above average (>1.5 times average and < 2 times average) HAF5 repeat (>=2 times average and < 5 times average) HAF6 'crazy' repeat (> 5 times average) MNRr stretches which were masked away by -SK:mnr=yes Examples for hash statistics Selecting the right ratio so that an assembly fits into your memory is not straight forward. But MIRA can help you a bit: during assembly, some frequency statistics are printed out (they'll probably end up in some info file in later releases). Search for the term "Hash statistics" in the information printed out by MIRA (this happens quite early in the process) Caveat: -SK:bph Some explanation how bph affects the statistics and why it should be chosen >=16 for -SK:mnr Sanger sequencing, a simple bacterium This example is taken from a pretty standard bacterium where Sanger sequencing was used: Hash statistics: ========================================================= Measured avg. coverage: 15 Deduced thresholds: ------------------- Min normal cov: 7 Max normal cov: 23 Repeat cov: 29 Crazy cov: 120 Mask cov: 150 Repeat ratio histogram: ----------------------- 0 475191 1 5832419 2 181994 3 6052 4 4454 5 972 6 4 7 8 14 2 16 10 ========================================================= The above can be interpreted like this: the expected coverage of the genome is 15x. Starting with an estimated hash frequency of 29, MIRA will treat a k-mer as 'repetitive'. As shown in the histogram, the overall picture of this project is pretty healthy: * only a small fraction of k-mers have a repeat level of '0' (these would be k-mers in regions with quite low coverage or k-mers containing sequencing errors) * the vast majority of k-mers have a repeat level of 1 (so that's non- repetitive coverage) * there is a small fraction of k-mers with repeat level of 2-10 * there are almost no k-mers with a repeat level >10 454 Sequencing, a somewhat more complex bacterium Here's in comparison a profile for a more complicated bacterium (454 sequencing): Hash statistics: ========================================================= Measured avg. coverage: 20 Deduced thresholds: ------------------- Min normal cov: 10 Max normal cov: 30 Repeat cov: 38 Crazy cov: 160 Mask cov: 0 Repeat ratio histogram: ----------------------- 0 8292273 1 6178063 2 692642 3 55390 4 10471 5 6326 6 5568 7 3850 8 2472 9 708 10 464 11 270 12 140 13 136 14 116 15 64 16 54 17 54 18 52 19 50 20 58 21 36 22 40 23 26 24 46 25 42 26 44 27 32 28 38 29 44 30 42 31 62 32 116 33 76 34 80 35 82 36 142 37 100 38 120 39 94 40 196 41 172 42 228 43 226 44 214 45 164 46 168 47 122 48 116 49 98 50 38 51 56 52 22 53 14 54 8 55 2 56 2 57 4 87 2 89 6 90 2 92 2 93 2 1177 2 1181 2 ========================================================= The difference to the first bacterium shown is pretty striking: * first, the k-mers in repeat level 0 (below average) is higher than the k-mers of level 1! This points to a higher number of sequencing errors in the 454 reads than in the Sanger project shown previously. Or at a more uneven distribution of reads (but not in this special case). * second, the repeat level histogram does not trail of at a repeat frequency of 10 or 15, but it has a long tail up to the fifties, even having a local maximum at 42. This points to a small part of the genome being heavily repetitive ... or to (a) plasmid(s) in high copy numbers. Should MIRA ever have problems with this genome, switch on the nasty repeat masking and use a level of 15 as cutoff. In this case, 15 is OK to start with as a) it's a bacterium, it can't be that hard and b) the frequencies above level 5 are in the low thousands and not in the tens of thousands. Solexa sequencing, E.coli MG1655 Hash statistics: ========================================================= Measured avg. coverage: 23 Deduced thresholds: ------------------- Min normal cov: 11 Max normal cov: 35 Repeat cov: 44 Crazy cov: 184 Mask cov: 0 Repeat ratio histogram: ----------------------- 0 1365693 1 8627974 2 157220 3 11086 4 4990 5 3512 6 3922 7 4904 8 3100 9 1106 10 868 11 788 12 400 13 186 14 28 15 10 16 12 17 4 18 4 19 2 20 14 21 8 25 2 26 8 27 2 28 4 30 2 31 2 36 4 37 6 39 4 40 2 45 2 46 8 47 14 48 8 49 4 50 2 53 2 56 6 59 4 62 2 63 2 67 2 68 2 70 2 73 4 75 2 77 4 ========================================================= This hash statistics shows that MG1655 is pretty boring (from a repetitive point of view). One might expect a few repeats but nothing fancy: The repeats are actually the rRNA and sRNA stretches in the genome plus some intergenic regions. * the k-mers number in repeat level 0 (below average) is considerably lower than the level 1, so the Solexa sequencing quality is pretty good respectively there shouldn't be too many low coverage areas. * the histogram tail shows some faint traces of possibly highly repetitive k-mers, but these are false positive matches due to some standard Solexa base-calling weaknesses of earlier pipelines like, e.g., adding poly-A, poly-T or sometimes poly-C and poly-G tails to reads when spots in the images were faint and the base calls of bad quality (NEED EXAMPLES FOR EUKARYOTES) (NEED EXAMPLES FOR PATHOLOGICAL CASES) Vector contamination etc. -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html