RE: index columns

From: "Mark W. Farnham" <mwf@xxxxxxxx>
To: <jonathan@xxxxxxxxxxxxxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
Date: Sun, 19 Apr 2015 11:41:42 -0400

Nice thread. A side note regarding when a highly dominant value is present,
be it 99/1 or 80/2/2/2/2/2/2/2/2/2/2 or <you get the idea>,

then it is often useful to define the physical storage of the dominant value
as NULL. (You can interpret a mapping of NULL to a value notwithstanding
that the RDBMS must not without an NVL call).

When this can usefully be done a few things happen:

1) If it is a single column index, the index becomes relatively tiny AND
the CBO automatically has to avoid the index when searching for NULL,
because NULLs are not even in the index. Thus you either get a more
appropriate index selection or an appropriate full table or partition scan
set scan.

2) For multi-column indexes the compression of this highly repeated value
is finessed by it being zero in length and so the highly selective column
can be leftward promoted without harm (subject to otherwise being useful
regarding cluster factor, which might influence your choice regarding an
index of three or more columns).

Anyone who interprets this as an "always" rule skipped over the caveat "when
this can usefully be done" which is inherently dependent on the texture of
your data and the predicates actually used in your queries. When it IS a
good thing, it tends to be a VERY good thing, so it is worth CONSIDERING
when designing both data models and when the details of index choices are
decided.

mwf

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of Jonathan Lewis
Sent: Saturday, April 18, 2015 2:35 PM
To: oracle-l@xxxxxxxxxxxxx
Subject: RE: index columns

Mohamed,

Your question is exactly the point I was making about Tim's example of
yes_or_no with a 99/1 split and a nearly unique account_id. Just because
there is a massive skew in one column that doesn't mean you can see ANY skew
in the combination when the other column has a large number of distinct
values.

In particular - Oracle 11g has a limit of 255 columns in a frequency
histogram, 12c has a limit of 2,000 (or maybe 2,047), so your example with
one column having 2,712 distinct values CLEARLY can't get a frequency
histogram when you combine it with another column.

If BOTH columns had individual histograms the histogram on your vpk_id
column would have to be height-balanced (or top-N, or hybrid) and might be
identifying one or more extremely popular columns - in which case a popular
vpk_id value might frequently coincide with a popular layer_code, so that a
height-balanced/top-N/hybrid histogram on the column group could identify
the special combination. But since vpk_id doesn't have (need) a histogram
then the combination can't need a histogram.

Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
@jloracle

_____

From: Mohamed Houri [mohamed.houri@xxxxxxxxx]
Sent: 18 April 2015 15:50
To: Jonathan Lewis
Cc: oracle-l@xxxxxxxxxxxxx
Subject: Re: index columns

When all other things being equal, pushing the less selective columns toward
the leading edge of the index in order to have a better index
compressibility might be an option.

Below is the conclusion of an article I wrote for all things Oracle about
index design

<http://allthingsoracle.com/index-design-discard-and-sort/>
http://allthingsoracle.com/index-design-discard-and-sort/

Conclusion

Engineering an index should be dictated first by the query predicate part
(where clause, group by and order by). Look carefully to the column you will
use as the leading edge of the index. They should be the ones on which an
equality predicate is applied. You should also have a "kill two birds with
one stone" design strategy as far as with one index you can cover multiple
queries, <http://hourim.wordpress.com/2014/03/24/redundant-indexes/> avoid
redundant indexes and cover the foreign key lock threat. Do not forget the
benefit an indexed virtual column could have on helping the CBO make good
guesses (estimations) and producing attractive small indexes.

If a switch in the column order is still able to guaranty the precision and
the use of the index then start your index with the column having the lowest
number of distinct values. As such you can efficiently compress your index
and give a CBO an extra possible path represented by the index skip scan.

As per regards to the column group extension created on columns having or
not histogram collected on them, I have an article to be published on that
topic which I have kept for several months waiting to validate one CBO
estimation I am still unable to figure out.

Christian article shows this

"In other words, it seems that the query optimizer bypasses if the extension
has no histogram AND histograms exist on the columns on which the extension
is based."

My article shows that if only one column of the group of columns forming the
extension has histogram the CBO will by pass it

1) no histogram at all

SQL> SELECT column_name, num_distinct, density, histogram

FROM user_tab_col_statistics

WHERE table_name = 'T_EXT_STAT'

AND column_name in
('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');

COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM

------------------------------ ------------ ---------- ---------------

VPK_ID 2712 .000368732 NONE

LAYER_CODE 4 .25 NONE

SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 4731 .000211372 NONE

SQL> select

count(1)

from

t_ext_stat

where vpk_id = 63148

and layer_code = 'R';

----------------------------------------------------------------------------
----

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time
|

----------------------------------------------------------------------------
----

| 0 | SELECT STATEMENT | | 1 | | 1
|00:00:00.01 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1
|00:00:00.01 |

|* 2 | INDEX RANGE SCAN| T_EXT_UK_I | 1 | 172 | 338
|00:00:00.01 |

----------------------------------------------------------------------------
----

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

Access path analysis for T_EXT_STAT

***************************************

SINGLE TABLE ACCESS PATH

Single Table Cardinality Estimation for T_EXT_STAT[VPK]

SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1,
loc = 1 {EC(98564)[2, 3]}

SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE

Column (#2): VPK_ID(NUMBER)

AvgLen: 5 NDV: 2712 Nulls: 0 Density: 0.000000 Min: 0.000000 Max:
62849.000000

Column (#3): LAYER_CODE(VARCHAR2)

AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000

Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)

AvgLen: 12 NDV: 4731 Nulls: 0 Density: 0.000000

ColGroup (#2, Index) T_EXT_UK_I

Col#: 2 3 4 CorStregth: -1.00

ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1

Col#: 2 3 CorStregth: 2.29

ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002

Table: T_EXT_STAT Alias: VPK

Card: Original: 813541.000000 Rounded: 172 Computed: 171.96 Non
Adjusted: 171.96

***** Logdef predicate Adjustment ******

The corresponding 10053 trace file clearly shows in this case that the
extension has been used:

ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002

E-Rows = num_rows(t_ext_stat) * selectivity
(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1);

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1));

E-rows = 813541.000000 * 1/4731 = 171.959628 rounded to 172

2) with histogram

However, collecting histogram changed the CBO estimation

begin

dbms_stats.gather_table_stats

(user

,'t_ext_stat'

,method_opt => 'for all columns size auto'

,cascade => true

,no_invalidate => false

);

end;

/

SQL> SELECT column_name, num_distinct, density, histogram

FROM user_tab_col_statistics

WHERE table_name = 'T_EXT_STAT'

AND column_name in
('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');

COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM

------------------------------ ------------ ---------- ---------------

SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 4967 .000201329 NONE

LAYER_CODE 4 6.2471E-07 FREQUENCY

VPK_ID 2862 .000349406 NONE

The layer_code column, as expected, has been identified as a skewed column
and henceforth a frequency histogram has been gathered on it to indicate
this skewness. There is two remarks which seems to be worth pointing them
out

* Since one of the column group extension has a histogram why the
extension itself has not been identified as a skewed column as well
* What happens in this particular case where there is no histogram on
the extension and a histogram on one of the column forming the extension

select

count(1)

from

t_ext_stat

where vpk_id = 63148

and layer_code = 'R';

COUNT(1)

----------

338

SQL_ID d26ra17afbfyh, child number 0

-------------------------------------

----------------------------------------------------------------------------
-----

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time
|

----------------------------------------------------------------------------
-----

| 0 | SELECT STATEMENT | | 1 | | 1
|00:00:00.01 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1
|00:00:00.01 |

|* 2 | INDEX RANGE SCAN| T_EXT_UK_I | 1 | 142 | 338
|00:00:00.01 |

----------------------------------------------------------------------------
-----

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

A slight degradation on the estimation occurred. Has the CBO used the
extension to compute the 142 estimated cardinality? I don't think so
otherwise it would have given the following estimation:

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))

E-rows = 803809 * 1/(4967) = 161.829877

In addition, the following line in the 10053 trace file

ColGroup Usage:: PredCnt: 2 Matches Full: Partial:

confirmsthat the extended column statistics has not been used. Otherwise we
would have observed, as shown in the first 10053 trace file above, the
following line

ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002

This in fact confirms that when one of the columns on which an extension has
been created has a histogram while the virtual column representing the
extension has no histogram then the extension will not be used by the CBO.
But why Oracle has not collected a frequency histogram on the extension when
it knows that one of the columns participating in the extension posses a
histogram? Does this means that a data skewness is not guaranteed for a
combination of two columns where one column is skewed and the other isn't?

I hope I will publish this article asap

Best regards

Mohamed Houri

Follow-Ups:
- Re: index columns
  - From: Mladen Gogala

References:
- index columns
  - From: Orlando L
- Re: index columns
  - From: Stefan Koehler
- Re: index columns
  - From: Tim Gorman
- RE: index columns
  - From: Jonathan Lewis
- Re: index columns
  - From: Mohamed Houri
- RE: index columns
  - From: Jonathan Lewis

RE: index columns

Other related posts: