On Dienstag 04 August 2009 johann JOETS wrote: > [...] > Many thanks for your help. I use the mira_2.9.46_devlinux-gnu_x86_64 > version Hi Johann, I did have a look at those data sets ... and do get totally different results (but that might be because of changes in the V3rc1 line ... although I don't really believe that). But I do have quite a number of comments regarding the data and the way you calculated N50. I'll make a few opinionated statements, please don't take offense :-) 1) the simulated data is ... in a word ... worthless ---------------------------------------------------- The data "metasim" generated feels like it simulates the old GS20 reads ... and then even the very first ones that 454 deposited in 2005: lots of errors all over the place, sometimes dozens per read. However, technology has evolved. If nowadays a sequencing provider came to me with this quality, I would laugh at them, halt all payments and search for another provider. I've attached two images to this mail: a) the first image (100x.png) is showing the gap4 alignment of the simulated 100x data set. It's literally riddled with errors and indels. Even though the location shown has only ~60x coverage, almost every third consensus base is an indel. At which point the remark from Joseph just a moment ago comes into play: that's too much for an assembler as too many errors accumulate. Subsequently a lot reads cannot align anymore and get shuffled into other contigs (small contigs, not too much coverage, what I call contig debris). Also, the trimming routines of MIRA have a fun time in trimming: more than half of the reads have an additional trim (usually that's in the 5 to 10% range at most). b) the second image (bceti.png) is from actual, live FLX data from 2008. It's B.ceti data downloaded from the NCBI (SRR005481). Believe me I had to search for almost a minute to find a place where I found more than one or two errors, a lot of stretches are simply perfect. In summary, you should look whether metasim has a preset to simulate FLX or even Titanium data. If not, you should mail the authors and ask them to implement it ASAP or retract the software as it simply does not represent state of the art since late 2007 (I have FLX reads from then which looks exactly like the bceti reads from the NCBI 2008). 2) quality data is lacking -------------------------- Sorry, I have to be adamant on this: any genome assembly without quality values is nowadays a big No-No. Subsequently, simulations should also use quality values. At least MIRA relies on those values to tear apart wrongly assembled repeats, calculate a better consensus etc.pp. Remember: every time someone assembles DNA sequences without using quality data, God kills a cute little kitten. You don't want that, do you? 3) Too simplistic N50 numbers ----------------------------- Due to the too high error rate of the simulated data, quite some reads could not be assembled and ended in contig debris. Technically speaking these are contigs ... practically they're junk and should be discarded. Here's why. Have a look at the following tavle which was generated by sorting the MIRA info file "*_info_contigstats.txt" like this: sort -g -r -k 2 100X_sim454_d_info/100X_sim454_info_contigstats.txt (This is sorting the contigs by size in descending order. I've shortened the contig names and cut away the right part of the table to fit this into 80 columns for mails) # name length av.qual #-reads mx.cov. av.cov GC% CnIUPAC CnFunny --------------------------------------------------------------------------- c1 150398 20 46088 117 66.96 45.98 1 0 c17 1340 19 103 35 18.23 46.90 1 0 c25 1077 18 62 37 13.99 46.43 0 0 c157 1004 20 36 17 8.56 43.37 1 0 c53 969 20 62 31 15.50 45.30 0 0 c14 939 19 60 41 15.73 46.22 0 0 c30 936 19 49 32 12.38 46.63 1 0 ... Observe that there's one large contig with 150kb which MIRA built and tells us that the average coverage is 66.96. Now look at all remaining "contigs": although some of them even reach 1000 bases, they all have a miserable average coverage which is less than 1/3 or even 1/2 of the average coverage of that project. MIRA actually knows that this is junk and will calculate statistics accordingly: have a look at the file 100X_sim454_d_info/100X_sim454_info_assembly.txt ------------------------------------------------------------------------ Assembly information: ===================== Num. reads assembled: 56115 Num. singlets: 198 Large contigs: -------------- With Contig size >= 500 AND (Total avg. Cov >= 22 OR Cov(san) >= 0 OR Cov(454) >= 22 OR Cov(sxa) >= 0 OR Cov(sid) >= 0 ) Length assessment: ------------------ Number of contigs: 2 Total consensus: 151027 Largest contig: 150398 N50 contig size: 150398 N90 contig size: 150398 N95 contig size: 150398 Coverage assessment: -------------------- Max coverage (total): 117 Max coverage Sanger: 0 454: 121 Solexa: 0 Solid: 0 Avg. total coverage (size >= 5000): 66.96 Avg. coverage (contig size >= 5000) Sanger: 0.00 454: 66.96 Solexa: 0.00 Solid: 0.00 ... ------------------------------------------------------------------------ That is, MIRA just takes two contigs (the large one and one of the debris ones) into consideration for "real" data. The rest can safely be regarded as junk. This is also the reason why I pointed you to http://chevreux.org/uploads/media/mira3_usage.html#section_14 ("What to do with MIRA result files") regarding this. Looking back, you're not the first one to run into all these traps of assembly. I think I should change a few things in MIRA. I) have MIRA stop immediately if not quality data is present ... except if the user forces MIRA to disregard this ("I know what I do"-style) II) have result files split (yet once again) into different subfiles. Like Newbler in "normal" results and "all results". Or perhaps in "normal results" and "junk results" to make it clear what the later is. III) Get an additional page on how to benchmark assemblers (data to use, tests etc.) as well as what not to do. Any thoughts? Regards, Bastien
Attachment:
100x.png
Description: PNG image
Attachment:
bceti.png
Description: PNG image