Re: Which one is good in terms of performance

From: Tim Gorman <tim@xxxxxxxxx>
To: "'Oracle-L (E-mail)" <oracle-l@xxxxxxxxxxxxx>
Date: Fri, 24 Dec 2004 08:23:17 -0700
Seema and Wolfgang,
Here is some info I generated recently, to add to the discussion...

My apologies for the length of the post, but feel free to delete if not
interested...


Accuracy when calculating table statistics with different sample sizes:
> 
These are the cumulated averaged results from comparing statistics gathered
using both DBMS_STATS and ANALYZE TABLE against the baseline of performing a
COMPUTE using DBMS_STATS.  For the first round, I concentrated on table
statistics, and ignored column statistics (a.k.a. histograms) by using ³FOR
ALL COLUMNS SIZE 1².
 
I tested against four major tables in the Oracle E-Business Suite ERP
system, of varying sizes (from small to humongous).

For each of the table statistics gathered for the cost-based optimizer (i.e.
the columns NUM_ROWS, BLOCKS, and AVG_ROW_LEN in DBA_TABLES), displayed is
the percentage deviation from the baseline of a COMPUTE using DBMS_STATS:
 
                                                   Nbr of
                                                   tables % Deviation %
Deviation % Deviation
LABEL                                             sampled Nbr of Blks Nbr of
Rows Avg Row Len
-------------------------------------------------- ------- -----------
----------- -----------
DBMS_STATS: COMPUTE, FOR ALL COLUMNS SIZE 1              4       0.00
0.00        0.00
DBMS_STATS: 10.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.02        0.00
DBMS_STATS: 05.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.03        0.00
DBMS_STATS: 01.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.10        0.00
DBMS_STATS: 50.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.22        0.00
DBMS_STATS: 10.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
0.15        0.11
DBMS_STATS: 00.010000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.46        0.00
DBMS_STATS: 20.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.58        0.00
DBMS_STATS: 05.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
0.91        0.11
DBMS_STATS: 00.000001%, FOR ALL COLUMNS SIZE 1           4       0.00
1.12        0.00
DBMS_STATS: 20.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
1.35        0.00
DBMS_STATS: 00.010000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
1.47        0.00
DBMS_STATS: 00.000001% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
1.58        0.00
DBMS_STATS: 50.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
2.03        0.00
DBMS_STATS: 01.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
4.61        0.00
ANALYZE: 05.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
0.42        7.56
ANALYZE: COMPUTE, FOR ALL COLUMNS SIZE 1                 4       0.00
0.00        8.34
ANALYZE: 50.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
0.00        8.34
ANALYZE: 20.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
0.62        7.73
ANALYZE: 10.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
1.44        7.15
ANALYZE: 01.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
7.24        8.62
ANALYZE: 00.010000%, FOR ALL COLUMNS SIZE 1              4       0.00
7.24        8.62
ANALYZE: 00.000001%, FOR ALL COLUMNS SIZE 1              4       0.00
7.24        8.62
 
The sorting in this report is by the summarization of the three percentage
columns, from "most accurate" results to "least accurate" results.
 
A couple things to note about the results shown here:

1. The value in BLOCKS is always completely accurate, regardless of sample
size, because the value is not calculated at all, but is obtained directly
from the segment header
2. The very worst calculations from the DBMS_STATS package are (apparently)
more accurate than the very best calculations from ANALYZE command
> * The biggest cause seems to be vastly different results on AVG_ROW_LEN
>> * However, the method of calculating AVG_ROW_LEN by DBMS_STATS is easily
>> exposed by performing SQL trace
>>> * Method is:  select avg(vsize(col-1)) + avg(vsize(col-2)) + ... +
>>> avg(vsize(col-N)) from table-name ...
> * The method of calculating AVG_ROW_LEN by the ANALYZE command cannot be
> determined using SQL trace
3. The worst deviation with the results from DBMS_STATS when calculating
NUM_ROWS is 4.61%, while the worst deviation with the results from ANALYZE
is almost twice that, at 7.24%
4. The results from ANALYZE TABLE ... SAMPLE 1 PERCENT, SAMPLE 0.01 PERCENT,
and SAMPLE 0.000001 PERCENT are exactly the same, indicating the lowest
valid sampling value for the ANALYZE command is 1 PERCENT
> * DBMS_STATS allows 0.000001% sampling
> * DBMS_STATS also permits the use of BLOCK SAMPLING, which ANALYZE does not
 
Accuracy when calculating index statistics with different sample sizes:

In contrast, where even the very worst results obtained by the smallest
sample of table statistics is just 7.24% off from performing a COMPUTE
operation, the results from using smaller samples when calculating index
statistics get much worse very quickly as the sampling size decreases.

Here is the results of testing against the nine indexes of the
FND_CONCURRENT_REQUESTS table, which vary widely in terms of density, data
characteristics, etc.  The results are being sorted by the summation of the
percentage columns, so that the "most accurate" shows first and the "least
accurate" shows last...

                               Nbr of                             % Dev'n  %
Dev'n
                              indexes % Dev'n   % Dev'n   % Dev'n Avg Leaf
Avg Data % Dev'n       % Dev'n
LABEL                         sampled  BLevel Leaf Blks Dist Keys Blks/Key
Blks/Key Clu Fac # Idx Entries
------------------------------ ------- ------- --------- --------- --------
-------- ------- -------------
DBMS_STATS: INDEX COMPUTE           9    0.00      0.00      0.00     0.00
0.00   0.00          0.00
DBMS_STATS: INDEX 05.000000%        9    0.00      0.00     22.13    14.85
16.40   0.44          0.00
DBMS_STATS: INDEX 20.000000%        9    0.00      0.00     25.39    24.29
24.22   0.44          0.00
DBMS_STATS: INDEX 10.000000%        9    0.00      0.00     25.56    16.61
33.89   0.44          0.00
DBMS_STATS: INDEX 00.010000%        9    0.00      0.00     27.45    20.13
37.91   0.44          0.00
DBMS_STATS: INDEX 50.000000%        9    0.00      0.00     26.42    21.53
38.65   0.44          0.00
DBMS_STATS: INDEX 01.000000%        9    0.00      0.00     31.17    27.90
28.18   0.44          0.00
DBMS_STATS: INDEX 00.000001%        9    0.00      0.00     27.81    22.46
40.37   0.44          0.00
ANALYZE: INDEX COMPUTE              9    0.00    595.30     0.00    239.88
0.00   0.00          0.00
ANALYZE: INDEX 50.000000%           9    0.00    596.27     23.55   329.31
13.85  17.29         17.60
ANALYZE: INDEX 01.000000%           9    8.33    530.13     32.28   353.86
23.04  32.87         34.08
ANALYZE: INDEX 00.010000%           9    8.33    530.13     32.28   353.86
23.04  32.87         34.08
ANALYZE: INDEX 00.000001%           9    8.33    530.13     32.28   353.86
23.04  32.87         34.08
ANALYZE: INDEX 20.000000%           9    0.00    596.27     32.97   551.43
58.40  19.22         19.51
ANALYZE: INDEX 05.000000%           9    0.00    596.27     40.77   854.46
59.40  24.67         23.23
ANALYZE: INDEX 10.000000%           9    0.00    596.27     40.86   792.72
141.73  29.25         29.07

First of all, there is the huge deviation the calculations of LEAF_BLOCKS by
DBMS_STATS and by ANALYZE. Each command (DBMS_STATS or ANALYZE) is
consistent, so one must be wrong and the other right.  Unfortunately, in
this case, it is the DBMS_STATS package that is uniformly wrong on this,
because the results produced by the ANALYZE INDEX commands are corroborated
by similar results from the ANALYZE INDEX ... VALIDATE STRUCTURE command,
the DBMS_SPACE package, and the values stored in the DBA_SEGMENTS view.  I
searched MetaLink for a bug report related to DBMS_STATS and LEAF_BLOCKS,
but could not find one.  I intend to retest this against 9.2.0.5 and
10.1.0.3, when I get a chance.  I¹ll also test it against 9.2.0.1...

A few more things to note about the results shown here:

1. DBMS_STATS.GATHER_INDEX_STATS and ANALYZE INDEX produce the exact same
results when COMPUTEing
> * except the LEAF_BLOCKS anomaly
>> * which also apparently causes an anomaly with AVG_LEAF_BLOCKS_PER_KEY
2. the ANALYZE INDEX ... COMPUTE results are corroborated by other
utilities, so the ANALYZE INDEX command is more accurate in v9.2.0.4 than
the DBMS_STATS.GATHER_INDEX_STATS command
3. The results from both DBMS_STATS and ANALYZE are wildly worse when
performing an estimated sample, instead of compute

That last point is the most important.  Nonetheless, since the ANALYZE INDEX
... COMPUTE command is clearly the most accurate, here is the same result
data, this time using the results from ANALYZE INDEX .. COMPUTE STATISTICS
as the baseline instead of DBMS_STATS:
> 
                               Nbr of                             % Dev'n  %
Dev'n
                              indexes % Dev'n   % Dev'n   % Dev'n Avg Leaf
Avg Data % Dev'n       % Dev'n
LABEL                         sampled  BLevel Leaf Blks Dist Keys Blks/Key
Blks/Key Clu Fac # Idx Entries
------------------------------ ------- ------- --------- --------- --------
-------- ------- -------------
ANALYZE: INDEX COMPUTE              9    0.00     0.00      0.00    0.00
0.00    0.00          0.00
ANALYZE: INDEX 50.000000%           9    0.00     0.13     23.55    15.18
13.85   17.29         17.60
DBMS_STATS: INDEX COMPUTE           9    0.00    85.04      0.00    41.30
0.00    0.00          0.00
DBMS_STATS: INDEX 05.000000%        9    0.00    85.04     22.13    38.50
16.40    0.44          0.00
DBMS_STATS: INDEX 20.000000%        9    0.00    85.04     25.39    36.70
24.22    0.44          0.00
ANALYZE: INDEX 01.000000%           9    8.33     8.42     32.28    36.11
23.04   32.87         34.08
ANALYZE: INDEX 00.010000%           9    8.33     8.42     32.28    36.11
23.04   32.87         34.08
ANALYZE: INDEX 00.000001%           9    8.33     8.42     32.28    36.11
23.04   32.87         34.08
DBMS_STATS: INDEX 01.000000%        9    0.00    85.04     31.17    36.24
28.18    0.44          0.00
DBMS_STATS: INDEX 10.000000%        9    0.00    85.04     25.56    38.07
33.89    0.44          0.00
ANALYZE: INDEX 20.000000%           9    0.00     0.13     32.97    53.44
58.40   19.22         19.51
DBMS_STATS: INDEX 50.000000%        9    0.00    85.04     26.42    37.22
38.65    0.44          0.00
DBMS_STATS: INDEX 00.010000%        9    0.00    85.04     27.45    37.53
37.91    0.44          0.00
DBMS_STATS: INDEX 00.000001%        9    0.00    85.04     27.81    37.05
40.37    0.44          0.00
ANALYZE: INDEX 05.000000%           9    0.00     0.13     40.77   107.16
59.40   24.67         23.23
ANALYZE: INDEX 10.000000%           9    0.00     0.13     40.86    95.99
141.73   29.25         29.07

Again, this perspective shows the two types of COMPUTE largely agreeing with
each other (except for the LEAF_BLOCKS and the AVG_LEAF_BLOCKS_PER_KEY value
derived from it), with all other results showing dramatically less accurate
due to decreases in the sample size.

For indexes, clearly computing yields the best results, while a bug in
DBMS_STATS appears to be skewing the LEAF_BLOCK results in v9.2.0.4. making
the ANALYZE INDEX command the best alternative for that version.

So, from this data, which is admittedly sparse and limited (needs more
testing on a wider range of versions and applications), I think that one can
conclude:

* Perform the gathering of table/column and index statistics separately from
one another
> * Gather statistics on tables and columns in one operation using
> DBMS_STATS.GATHER_TABLE_STATS
>> * leave the CASCADE parameter at the default of FALSE
>> * use a low ESTIMATE_PERCENT and even use BLOCK_SAMPLE=>TRUE, if necessary
> * Gather statistics on indexes in another operation to either ANALYZE INDEX or
> DBMS_STATS.GATHER_INDEX_STATS
>> * only do COMPUTE
* There is a bug in DBMS_STATS.GATHER_INDEX_STATS in 9.2.0.4 regarding the
calculation of LEAF_BLOCKS that needs to be explored further...

From this, I don¹t think that one can conclude that it is better to ANALYZE
INDEX ... COMPUTE instead of
DBMS_STATS.GATHER_INDEX_STATS(ESTIMATE_PERCENT=>NULL), because I don¹t know
if the anomaly found exists outside of 9.2.0.4...

Hope this helps...

-Tim


on 12/23/04 1:26 PM, Wolfgang Breitling at breitliw@xxxxxxxxxxxxx wrote:

> first off, whatever you do, DO NOT use 2) dbms_utility
> 
> What I do is:
> a) DO NOT rely on Oracle's "staleness" algorithm but decide myself which
> tables need to be analyzed and at what frequency
> b) DO NOT use "for all indexed columns" but decide myself which columns
> require a histogram and with how many buckets. Those columns might very
> well include non-indexed ones and will certainly not include all indexed
> columns. I have an example where the creation of histograms on indexed
> columns led to a batch job taking an estimated 18+ hours (if we had had
> the patience to let it finish instead of killing it after 6 hours)
> instead of the ~90 seconds without the histograms
> c) use the dbms_stats procedures
> d) use estimate_percent=>dbms_stats.auto_sample-size, cascade=>false,
> method_opt=>'for all columns size 1' (i.e. the defaults for the latter
> two parameters) for tables
> e) follow with gather_index_stats with estimate_percent=>null (i.e. full
> compute) for indexes
> f) follow with gather_table_stats with estimate_percent=> (i.e. full
> compute) for individual column histograms.


--
//www.freelists.org/webpage/oracle-l
Follow-Ups:
- Re: Which one is good in terms of performance
  - From: Wolfgang Breitling
- Re: Which one is good in terms of performance
  - From: Jared Still
Re: Which one is good in terms of performance

Other related posts: