[mira_talk] Re: N50 decrease while sequencing depth increase ?

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Tue, 4 Aug 2009 22:45:18 +0200

On Dienstag 04 August 2009 johann JOETS wrote:
> [...]
> Many thanks for your help. I use the mira_2.9.46_devlinux-gnu_x86_64
> version

Hi Johann,

I did have a look at those data sets ... and do get totally different results 
(but that might be because of changes in the V3rc1 line ... although I don't 
really believe that).

But I do have quite a number of comments regarding the data and the way you 
calculated N50.

I'll make a few opinionated statements, please don't take offense :-)

1) the simulated data is ... in a word ... worthless
----------------------------------------------------
The data "metasim" generated feels like it simulates the old GS20 reads ... 
and then even the very first ones that 454 deposited in 2005: lots of errors 
all over the place, sometimes dozens per read.

However, technology has evolved. If nowadays a sequencing provider came to me 
with this quality, I would laugh at them, halt all payments and search for 
another provider.

I've attached two images to this mail: 

a) the first image (100x.png) is showing the gap4 alignment of the simulated 
100x data set. It's literally riddled with errors and indels. Even though the 
location shown has only ~60x coverage, almost every third consensus base is an 
indel. At which point the remark from Joseph just a moment ago comes into 
play: that's too much for an assembler as too many errors accumulate. 
Subsequently a lot reads cannot align anymore and get shuffled into other 
contigs (small contigs, not too much coverage, what I call contig debris).
Also, the trimming routines of MIRA have a fun time in trimming: more than 
half of the reads have an additional trim (usually that's in the 5 to 10% 
range at most).

b) the second image (bceti.png) is from actual, live FLX data from 2008. It's 
B.ceti data downloaded from the NCBI (SRR005481). Believe me I had to search 
for almost a minute to find a place where I found more than one or two errors, 
a lot of stretches are simply perfect.

In summary, you should look whether metasim has a preset to simulate FLX or 
even Titanium data. If not, you should mail the authors and ask them to 
implement it ASAP or retract the software as it simply does not represent 
state of the art since late 2007 (I have FLX reads from then which looks 
exactly like the bceti reads from the NCBI 2008).

2) quality data is lacking
--------------------------
Sorry, I have to be adamant on this: any genome assembly without quality 
values is nowadays a big No-No. Subsequently, simulations should also use 
quality values. At least MIRA relies on those values to tear apart wrongly 
assembled repeats, calculate a better consensus etc.pp.

Remember: every time someone assembles DNA sequences without using quality 
data, God kills a cute little kitten. You don't want that, do you?

3) Too simplistic N50 numbers
-----------------------------
Due to the too high error rate of the simulated data, quite some reads could 
not be assembled and ended in contig debris. Technically speaking these are 
contigs ... practically they're junk and should be discarded. Here's why. Have 
a look at the following tavle which was generated by sorting the MIRA info 
file "*_info_contigstats.txt" like this:

  sort -g -r -k 2 100X_sim454_d_info/100X_sim454_info_contigstats.txt 

(This is sorting the contigs by size in descending order. I've shortened the 
contig names and cut away the right part of the table to fit this into 80 
columns for mails)

# name      length  av.qual #-reads mx.cov. av.cov  GC%     CnIUPAC CnFunny 
---------------------------------------------------------------------------
c1          150398  20      46088   117     66.96   45.98   1       0           

c17         1340    19      103     35      18.23   46.90   1       0           

c25         1077    18      62      37      13.99   46.43   0       0           

c157        1004    20      36      17      8.56    43.37   1       0           

c53         969     20      62      31      15.50   45.30   0       0           

c14         939     19      60      41      15.73   46.22   0       0           

c30         936     19      49      32      12.38   46.63   1       0       
...

Observe that there's one large contig with 150kb which MIRA built and tells us 
that the average coverage is 66.96.

Now look at all remaining "contigs": although some of them even reach 1000 
bases, they all have a miserable average coverage which is less than 1/3 or 
even 1/2 of the average coverage of that project.

MIRA actually knows that this is junk and will calculate statistics 
accordingly: have a look at the file 
100X_sim454_d_info/100X_sim454_info_assembly.txt

------------------------------------------------------------------------
Assembly information:
=====================

Num. reads assembled: 56115
Num. singlets: 198

Large contigs:
--------------
With    Contig size             >= 500
        AND (Total avg. Cov     >= 22
             OR Cov(san)        >= 0
             OR Cov(454)        >= 22
             OR Cov(sxa)        >= 0
             OR Cov(sid)        >= 0
            )

  Length assessment:
  ------------------
  Number of contigs:    2
  Total consensus:      151027
  Largest contig:       150398
  N50 contig size:      150398
  N90 contig size:      150398
  N95 contig size:      150398

  Coverage assessment:
  --------------------
  Max coverage (total): 117
  Max coverage
        Sanger: 0
        454:    121
        Solexa: 0
        Solid:  0
  Avg. total coverage (size >= 5000): 66.96
  Avg. coverage (contig size >= 5000)
        Sanger: 0.00
        454:    66.96
        Solexa: 0.00
        Solid:  0.00

...
------------------------------------------------------------------------

That is, MIRA just takes two contigs (the large one and one of the debris 
ones) into consideration for "real" data. The rest can safely be regarded as 
junk.

This is also the reason why I pointed you to 
http://chevreux.org/uploads/media/mira3_usage.html#section_14 ("What 
to do with MIRA result files") regarding this.

Looking back, you're not the first one to run into all these traps of 
assembly. I think I should change a few things in MIRA.

I) have MIRA stop immediately if not quality data is present ... except if the 
user forces MIRA to disregard this ("I know what I do"-style)

II) have result files split (yet once again) into different subfiles. Like 
Newbler in "normal" results and "all results". Or perhaps in "normal results" 
and "junk results" to make it clear what the later is.

III) Get an additional page on how to benchmark assemblers (data to use, tests 
etc.) as well as what not to do.

Any thoughts?

Regards,
  Bastien

Attachment: 100x.png
Description: PNG image

Attachment: bceti.png
Description: PNG image

Follow-Ups:
- [mira_talk] Re: N50 decrease while sequencing depth increase ?
  - From: johann JOETS

References:
- [mira_talk] N50 decrease while sequencing depth increase ?
  - From: johann JOETS
- [mira_talk] Re: N50 decrease while sequencing depth increase ?
  - From: Bastien Chevreux
- [mira_talk] Re: N50 decrease while sequencing depth increase ?
  - From: johann JOETS

[mira_talk] Re: N50 decrease while sequencing depth increase ?

Other related posts: