I have found this whole discussion very interesting, and I really
appreciate the technical side of the discussion. I have to admit that the
internals of stats is one of those items I really need to learn more about.
So, I was looking at this from the viewpoint of a user (or at least trying,
as we all know every DBA's live would be perfect without users), In my
experience, the leading column should be the most selective piece of data
that the user or program is likely to have available. Without knowing the
design of the application itself, it seems unlikely the user will know the
account id. On the other hand, if this is the second query, and the first
query to the database is some sort of reference table, it is highly likely
that a column such as account id is what is returned from that first query,
which would make it the most appropriate leading column.
On Sun, Apr 19, 2015 at 11:08 AM, Mohamed Houri <mohamed.houri@xxxxxxxxx>
wrote:
Thanks Jonathan ,
It's always a pleasure reading you.
In my article I continued doing this:
begin
dbms_stats.gather_table_stats
(user
,'t_ext_stat'
,method_opt => 'for columns vpk_id size skewonly'
,cascade => true
,no_invalidate => false
);
end;
/
SQL> SELECT column_name, num_distinct, density, histogram
FROM user_tab_col_statistics
WHERE table_name = 'T_EXT_STAT'
AND column_name in
('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM
------------------------------ ------------ ---------- ---------------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 4967 .000201329 NONE
LAYER_CODE 4 6.2471E-07 FREQUENCY
VPK_ID 246 .000375 HYBRID
As you have explained above, vpk_id got a HYBRID histogram and the
extension got no histogram again.
SQL> begin
dbms_stats.gather_table_stats
(user
,'t_ext_stat'
,method_opt => 'for all columns size skewonly'
,cascade => true
,no_invalidate => false
);
end;
/
SQL> SELECT column_name, num_distinct, density, histogram
2 FROM user_tab_col_statistics
3 WHERE table_name = 'T_EXT_STAT'
4 AND column_name in
('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');
COLUMN_NAME NUM_DISTINCT DENSITY
HISTOGRAM
------------------------------------------------------------------
----------------------- --------------- ---------------------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 5028 .000198886 NONE
LAYER_CODE 4
6.2846E-07 FREQUENCY
VPK_ID 246
.000351 HYBRID
Even when forcing a histogram on all skewed columns the extension has
not been found by Oracle to be skewed even when all the columns it is based
on have histogram (which confirm what you have said above that a skew in
the two columns forming the extension doesn’t necessarily mean that the
extension will be identified as skewed)
Best regards
Mohamed Houri
2015-04-18 20:34 GMT+02:00 Jonathan Lewis <jonathan@xxxxxxxxxxxxxxxxxx>:
Mohamed,
Your question is exactly the point I was making about Tim's example of
yes_or_no with a 99/1 split and a nearly unique account_id. Just because
there is a massive skew in one column that doesn't mean you can see ANY
skew in the combination when the other column has a large number of
distinct values.
In particular - Oracle 11g has a limit of 255 columns in a frequency
histogram, 12c has a limit of 2,000 (or maybe 2,047), so your example with
one column having 2,712 distinct values CLEARLY can't get a frequency
histogram when you combine it with another column.
If BOTH columns had individual histograms the histogram on your vpk_id
column would have to be height-balanced (or top-N, or hybrid) and might be
identifying one or more extremely popular columns - in which case a popular
vpk_id value might frequently coincide with a popular layer_code, so that a
height-balanced/top-N/hybrid histogram on the column group could identify
the special combination. But since vpk_id doesn't have (need) a histogram
then the combination can't need a histogram.
Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
@jloracle
------------------------------
*From:* Mohamed Houri [mohamed.houri@xxxxxxxxx]
*Sent:* 18 April 2015 15:50
*To:* Jonathan Lewis
*Cc:* oracle-l@xxxxxxxxxxxxx
*Subject:* Re: index columns
When all other things being equal, pushing the less selective columns
toward the leading edge of the index in order to have a better index
compressibility might be an option.
Below is the conclusion of an article I wrote for all things Oracle about
index design
*http://allthingsoracle.com/index-design-discard-and-sort/
<http://allthingsoracle.com/index-design-discard-and-sort/>*
Conclusion
*Engineering an index should be dictated first by the query predicate
part (where clause, group by and order by). Look carefully to the column
you will use as the leading edge of the index. They should be the ones on
which an equality predicate is applied. You should also have a **“kill
two birds with one stone”** design strategy as far as with one index you
can cover multiple queries, **avoid redundant indexes*
<http://hourim.wordpress.com/2014/03/24/redundant-indexes/>* and cover
the foreign key lock threat. Do not forget the benefit an indexed virtual
column could have on helping the CBO make good guesses (estimations) and
producing attractive small indexes.*
*If a switch in the column order is still able to guaranty the precision
and the use of the index then start your index with the column having the
lowest number of distinct values**. As such you can efficiently compress
your index and give a CBO an extra possible path represented by the **index
skip scan**.*
As per regards to the column group extension created on columns having or
not histogram collected on them, I have an article to be published on that
topic which I have kept for several months waiting to validate one CBO
estimation I am still unable to figure out.
Christian article shows this
“*In other words, it seems that the query optimizer bypasses if the
extension has no histogram AND **histograms exist on the columns** on
which the extension is based.”*
My article shows that if only *one* column of the group of columns
forming the extension has histogram the CBO will by pass it
*1) no histogram at all*
SQL> SELECT column_name, num_distinct, density, histogram
FROM user_tab_col_statistics
WHERE table_name = 'T_EXT_STAT'
AND column_name in
('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM
------------------------------ ------------ ---------- ---------------
VPK_ID 2712 .000368732 NONE
LAYER_CODE 4 .25 NONE
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 4731 .000211372 NONE
SQL> select
count(1)
from
t_ext_stat
where vpk_id = 63148
and layer_code = 'R';
--------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
A-Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1
|00:00:00.01 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1
|00:00:00.01 |
|* 2 | INDEX RANGE SCAN| T_EXT_UK_I | 1 | 172 | 338
|00:00:00.01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')
Access path analysis for T_EXT_STAT
***************************************
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for T_EXT_STAT[VPK]
SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1,
loc = 1 {EC(98564)[2, 3]}
SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE
Column (#2): VPK_ID(NUMBER)
AvgLen: 5 NDV: 2712 Nulls: 0 Density: 0.000000 Min: 0.000000 Max:
62849.000000
Column (#3): LAYER_CODE(VARCHAR2)
AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
AvgLen: 12 NDV: 4731 Nulls: 0 Density: 0.000000
ColGroup (#2, Index) T_EXT_UK_I
Col#: 2 3 4 CorStregth: -1.00
ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
Col#: 2 3 CorStregth: 2.29
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002
Table: T_EXT_STAT Alias: VPK
Card: Original: 813541.000000 Rounded: 172 Computed: 171.96 Non
Adjusted: 171.96
***** Logdef predicate Adjustment ******
The corresponding 10053 trace file clearly shows in this case that the
extension has been used:
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002
E-Rows = num_rows(t_ext_stat) * selectivity (
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1);
E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1));
E-rows = 813541.000000 * 1/*4731* = 171.959628 rounded to 172
*2) with histogram*
However, collecting histogram changed the CBO estimation
begin
dbms_stats.gather_table_stats
(user
,'t_ext_stat'
,method_opt => 'for all columns size auto'
,cascade => true
,no_invalidate => false
);
end;
/
SQL> SELECT column_name, num_distinct, density, histogram
FROM user_tab_col_statistics
WHERE table_name = 'T_EXT_STAT'
AND column_name in
('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM
------------------------------ ------------ ---------- ---------------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 4967 .000201329 NONE
LAYER_CODE 4 6.2471E-07 FREQUENCY
VPK_ID 2862 .000349406 NONE
The *layer_code* column, as expected, has been identified as a skewed
column and henceforth a frequency histogram has been gathered on it to
indicate this skewness. There is two remarks which seems to be worth
pointing them out
-
Since one of the column group extension has a histogram why the
extension itself has not been identified as a skewed column as well
-
What happens in this particular case where there is no histogram on
the extension and a histogram on one of the column forming the extension
select
count(1)
from
t_ext_stat
where vpk_id = 63148
and layer_code = 'R';
COUNT(1)
----------
338
SQL_ID d26ra17afbfyh, child number 0
-------------------------------------
---------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
A-Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1
|00:00:00.01 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1
|00:00:00.01 |
|* 2 | INDEX RANGE SCAN| T_EXT_UK_I | 1 | 142 | 338
|00:00:00.01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')
A slight degradation on the estimation occurred. Has the CBO used the
extension to compute the 142 estimated cardinality? I don't think so
otherwise it would have given the following estimation:
E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 803809 * 1/(4967) = 161.829877
In addition, the following line in the 10053 trace file
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
confirmsthat the extended column statistics has not been used. Otherwise
we would have observed, as shown in the first 10053 trace file above, the
following line
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002
This in fact confirms that when *one* of the columns on which an
extension has been created has a histogram while the virtual column
representing the extension has no histogram then the extension will not be
used by the CBO. But why Oracle has not collected a frequency histogram on
the extension when it knows that one of the columns participating in the
extension posses a histogram? Does this means that a data skewness is not
guaranteed for a combination of two columns where one column is skewed and
the other isn't?
I hope I will publish this article asap
Best regards
Mohamed Houri
--
Houri Mohamed
Oracle DBA-Developer-Performance & Tuning
Member of Oraworld-team <http://www.oraworld-team.com/>
Visit My - Blog <http://www.hourim.wordpress.com/>
Let's Connect -
<http://fr.linkedin.com/pub/mohamed-houri/11/329/857/>*Linkedin
Profile <http://fr.linkedin.com/pub/mohamed-houri/11/329/857/>*
My Twitter <https://twitter.com/MohamedHouri> - MohamedHouri
<https://twitter.com/MohamedHouri>