Re: Density statistic calculation in case of histogram

  • From: Yong Huang <yong321@xxxxxxxxx>
  • To: neeraj.dba@xxxxxxxxx
  • Date: Tue, 7 Apr 2009 12:45:21 -0700 (PDT)

Neeraj,

The calculation of density in case of a histogram is documented in the US 
patent 6732085:
http://www.freepatentsonline.com/6732085.html
Jonathan Lewis's CBO book p.172 says the same: "sum of the square of the 
frequency of the nonpopular values / (number of nonnull rows * number of 
nonpopular nonull rows)", where "frequency" is called repetition count in the 
patent.

Intuitive understanding is beyond me. But it's not hard to apply the proposed 
formula. In your case of 5 bucket height-balanced histogram, it is:

bucket# 1    2    3    4    5
col_skew 1-10    10    10    10    10

10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2
-------------------------------------------- = 0.001
10000 * 90

So the correct answer is 0.001. If you run your test in 10.2.0.4, you'll get 
this number. In 10.2.0.1, you get 0.98209 for some reason. You would 
"accidentally" get that number if you applied the formula to your 10-bucket 
frequency (you call width-based) histogram:

bucket# 1    2    3    4    5    6    7    8    9    10
col_skew 1    2    3    4    5    6    7    8    9    10

10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+9910^2
--------------------------------------------------- = .98209
10000 * 10000

Yong Huang


Neeraj Bhatia wrote:

Version: 10.2.0.1.0

After reading an excellent paper "Histograms - Myths and Facts" by Wolfgang
Breitling, some points get cleared.

1)  In case of width-based histograms, density= 1/ (2*number of distinct
values)

Thus come the figure 0.00005

2) Without histograms, density should be equal to 1/number of distinct
values. But in my case, it is null. I am not using method_opt while
collecting statistics. The global default setting is FOR ALL INDEXED COLUMNS
SIZE 1.

3) In case of height-based histograms, I didn't get these calculations.

density = Σ cnt2 / ( num_rows˜ * Σ cnt )

“the sum of the squared frequencies of all non-popular values divided by the
sum of the frequencies of all non-popular values times the count of rows
with not null values of the histogram column�.



      
--
//www.freelists.org/webpage/oracle-l


Other related posts: