Vishnu,
Two things -
First - don't expect Oracle to do the perfect thing just because you can see an
obvious improvement.
Second - if you spot an anomaly, see if you can reproduce it in the simplest
possible way.
So (1st) In this case the "generic case" that Oracle is using is that an IN
subquery (or EXISTS) can be unnested and used as an inline view with distinct
to drive a join. The fact that your example is guaranteed to give Oracle a
unique set is a little special and the optimizer could, in principle, derive
this fact by sufficiently close examination of the inline view - but that means
adding a piece of complexity to the optimizer that might be irrelevant in
almost all cases (generally if you run an IN subquery it's probably because you
expect there to be multiple rows for each input and you want to stop after 1).
(2nd) Isolate the principle of the inline view, try this:
select distinct a.student_id from students a, temp b where a.student_id in
(1,2,3,4,5,6,7,8) and b.student_id = a.student_id;
It's "obvious" that this is going to generate a list of unique values even if
the "distinct" wasn't there, so the optimizer ought to avoid a hash/sort unique.
Here's the execution plan (19.3)
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |
A-Time | Buffers |
-----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 2 (100)| 8
|00:00:00.01 | 15 |
| 1 | SORT UNIQUE NOSORT | | 1 | 1 | 2 (50)| 8
|00:00:00.01 | 15 |
| 2 | NESTED LOOPS SEMI | | 1 | 8 | 1 (0)| 8
|00:00:00.01 | 15 |
| 3 | INLIST ITERATOR | | 1 | | | 8
|00:00:00.01 | 5 |
|* 4 | INDEX UNIQUE SCAN| T_PK | 8 | 8 | 1 (0)| 8
|00:00:00.01 | 5 |
|* 5 | INDEX UNIQUE SCAN | ST_PK | 8 | 8 | 0 (0)| 8
|00:00:00.01 | 10 |
-----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
4 - access(("B"."STUDENT_ID"=1 OR "B"."STUDENT_ID"=2 OR "B"."STUDENT_ID"=3 OR
"B"."STUDENT_ID"=4 OR "B"."STUDENT_ID"=5 OR "B"."STUDENT_ID"=6 OR
"B"."STUDENT_ID"=7 OR
"B"."STUDENT_ID"=8))
5 - access("B"."STUDENT_ID"="A"."STUDENT_ID")
filter(("A"."STUDENT_ID"=1 OR "A"."STUDENT_ID"=2 OR "A"."STUDENT_ID"=3 OR
"A"."STUDENT_ID"=4 OR "A"."STUDENT_ID"=5 OR "A"."STUDENT_ID"=6 OR
"A"."STUDENT_ID"=7 OR
"A"."STUDENT_ID"=8))
(Students is "create as select rownum student_id, ao.* from all_objects where
rownum <= 50000", temp is your definition.)
Note, by the way, that the optimizer has used transitive closure to copy the
inlist from students to temp. In general this is reasonable, in this specific
case (with the PK declared) it's pointless but another little enhancement that
would require a lot of careful coding and testing to avoid errors in all cases.
You've probably spotted that Oracle has used a sort unique here instead of a
hash unique - it's just a minor detail in the costing arithmetic. You'll also
notice that the cardinality estimate for the unique data is very silly. The
latter is a severe limitation of the optimizer in the general case (and another
indication that it hasn't tried to look for the uniqueness conditions that hold
in this query)
It's a sad fact that it's easy to find examples where the optimizer doesn't do
the best thing, or gets the numbers wrong - sometimes it's a case of the
historical momentum; sometimes it's due to using a generic approach for a
problem that has a few special cases; sometimes it's because the statistics
that have been gathered don't describe the data well; sometimes it's simply
because the problem is too hard. And sometimes it's a bug - in which case if
you can create a very simple reproducible demo it's nice to let Oracle Corp.
know.
If you ever need it in a production there's a
"no_use_hash_aggregation(@qb_name)" hint that you could use in your examples to
stop Oracle from using a hash unique when you want it to do a sort unique.
Regards
Jonathan Lewis
P.S. If you do come across other things that look like defects it's best to
see if you can create a model that uses pure SQL (or PL/SQL) because it's hard
enough to persuade first-line support to run SQL scripts, let alone do
something with Java. (And I've never bothered to acquaint myself with setting
up java on my sandbox, so I wouldn't be able to recreate you data without a lot
of hassle that I don't need.)
________________________________________
From: Vishnu Potukanuma <vishnupotukanuma@xxxxxxxxx>
Sent: 08 January 2020 10:51
To: Jonathan Lewis
Cc: Oracle L
Subject: Re: reverse key index -- Sorting expected?
Hi Jonathan,
I completely agree with you regarding the optimisation that oracle performs
when it comes to using hashing or sort unique or sometimes no hashing or sort
unique.. but the following still bugs me and cannot find a suitable explanation
as to why this is happening ----- if the query in the IN - clause uses join -
it is using hash unique.... and we cannot prevent it even though if it is not
required in some cases.....
considering the same example as above:
I created two additional tables temp and temp2 as follows:
create table temp as select student_id from students where student_id < 100;
create table temp2 as select student_id from students where student_id < 100;
alter table temp add constraint temp_pk primary key (student_id);
alter table temp2 add constraint temp2_pk primary key (student_id);
gathered statistics.... even histograms as well..
for all the below statements....
given every possible way....
select student_id from students where student_id in (select
a.student_id from temp a, temp2 b where a.student_id = b.student_id and
a.student_id in (1,2,3,4,5,6,7,8,9,10) and b.student_id in
(1,2,3,4,5,6,7,8,9,10))
Plan hash value: 1363404670
-----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)|
E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 2 (100)|
| 10 |00:00:00.01 | 12 | | | |
| 1 | NESTED LOOPS | | 1 | 1 | 16 | 2 (50)|
00:00:01 | 10 |00:00:00.01 | 12 | | | |
| 2 | VIEW | VW_NSO_1 | 1 | 1 | 13 | 1 (0)|
00:00:01 | 10 |00:00:00.01 | 8 | | | |
| 3 | HASH UNIQUE | | 1 | 1 | 6 | | |
10 |00:00:00.01 | 8 | 2294K| 2294K| 1041K (0)|
| 4 | NESTED LOOPS SEMI | | 1 | 1 | 6 | 1 (0)|
00:00:01 | 10 |00:00:00.01 | 8 | | | |
| 5 | INLIST ITERATOR | | 1 | | | | |
10 |00:00:00.01 | 4 | | | |
|* 6 | INDEX UNIQUE SCAN| TEMP_PK | 10 | 10 | 30 | 1
(0)| 00:00:01 | 10 |00:00:00.01 | 4 | | | |
|* 7 | INDEX UNIQUE SCAN | TEMP2_PK | 10 | 1 | 3 | 0
(0)| | 10 |00:00:00.01 | 4 | | | |
|* 8 | INDEX UNIQUE SCAN | STUDENT_IDX | 10 | 1 | 3 | 0
(0)| | 10 |00:00:00.01 | 4 | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
select student_id from students where student_id in (select
a.student_id from temp a, temp2 b where a.student_id = b.student_id and
a.student_id in (1,2,3,4,5,6,7,8,9,10))
Plan hash value: 1363404670
-----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)|
E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 2 (100)|
| 10 |00:00:00.01 | 17 | | | |
| 1 | NESTED LOOPS | | 1 | 1 | 18 | 2 (50)|
00:00:01 | 10 |00:00:00.01 | 17 | | | |
| 2 | VIEW | VW_NSO_1 | 1 | 1 | 13 | 1 (0)|
00:00:01 | 10 |00:00:00.01 | 8 | | | |
| 3 | HASH UNIQUE | | 1 | 1 | 6 | | |
10 |00:00:00.01 | 8 | 2294K| 2294K| 1044K (0)|
| 4 | NESTED LOOPS SEMI | | 1 | 1 | 6 | 1 (0)|
00:00:01 | 10 |00:00:00.01 | 8 | | | |
| 5 | INLIST ITERATOR | | 1 | | | | |
10 |00:00:00.01 | 4 | | | |
|* 6 | INDEX UNIQUE SCAN| TEMP_PK | 10 | 10 | 30 | 1
(0)| 00:00:01 | 10 |00:00:00.01 | 4 | | | |
|* 7 | INDEX UNIQUE SCAN | TEMP2_PK | 10 | 1 | 3 | 0
(0)| | 10 |00:00:00.01 | 4 | | | |
|* 8 | INDEX UNIQUE SCAN | STUDENT_IDX | 10 | 1 | 5 | 0
(0)| | 10 |00:00:00.01 | 9 | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
select student_id from students where student_id in (select
a.student_id from temp a, temp2 b where a.student_id = b.student_id and
a.student_id < 10)
Plan hash value: 2866610440
----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)|
E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 2 (100)| |
9 |00:00:00.01 | 14 | | | |
| 1 | NESTED LOOPS | | 1 | 8 | 144 | 2 (50)| 00:00:01 |
9 |00:00:00.01 | 14 | | | |
| 2 | VIEW | VW_NSO_1 | 1 | 8 | 104 | 1 (0)| 00:00:01 |
9 |00:00:00.01 | 5 | | | |
| 3 | HASH UNIQUE | | 1 | 8 | 48 | | | 9
|00:00:00.01 | 5 | 2294K| 2294K| 1047K (0)|
| 4 | NESTED LOOPS SEMI | | 1 | 8 | 48 | 1 (0)| 00:00:01 |
9 |00:00:00.01 | 5 | | | |
|* 5 | INDEX RANGE SCAN | TEMP_PK | 1 | 9 | 27 | 1 (0)|
00:00:01 | 9 |00:00:00.01 | 1 | | | |
|* 6 | INDEX UNIQUE SCAN| TEMP2_PK | 9 | 9 | 27 | 0 (0)|
| 9 |00:00:00.01 | 4 | | | |
|* 7 | INDEX UNIQUE SCAN | STUDENT_IDX | 9 | 1 | 5 | 0 (0)|
| 9 |00:00:00.01 | 9 | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------------
select student_id from students where student_id in ( select a.student_id from
temp a, temp2 b where a.student_id = b.student_id and a.student_id between 1
and 10 and b.student_id between 1 and 10)
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9 | 144 | 2 (50)| 00:00:01 |
| 1 | NESTED LOOPS | | 9 | 144 | 2 (50)| 00:00:01 |
| 2 | VIEW | VW_NSO_1 | 9 | 117 | 1 (0)| 00:00:01 |
| 3 | HASH UNIQUE | | 9 | 54 | | |
| 4 | NESTED LOOPS SEMI | | 9 | 54 | 1 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | TEMP_PK | 10 | 30 | 1 (0)| 00:00:01 |
|* 6 | INDEX UNIQUE SCAN| TEMP2_PK | 10 | 30 | 0 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | STUDENT_IDX | 1 | 3 | 0 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
we know that the values are unique.... when it comes to estimated rows....
NESTED LOOPS SEMI ---> it always thinks/estimates that there is 1 row less...
This is one aspect of it...
another aspect of what i was especially concerned is the use of SORT - ORDER
BY... just adding a ORDER BY STUDENT_ID to the main statement, we always end up
sorting the results which can be explained with use of HASH UNIQUE... it is
using HASH UNIQUE even for very small resultsets
----------------------------------------------------------
Plan hash value: 2349408382
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9 | 54 | 2 (50)| 00:00:01 |
| 1 | SORT ORDER BY | | 9 | 54 | 2 (50)| 00:00:01 |
| 2 | NESTED LOOPS | | 9 | 54 | 2 (50)| 00:00:01 |
| 3 | VIEW | VW_NSO_1 | 9 | 27 | 1 (0)| 00:00:01 |
| 4 | HASH UNIQUE | | 9 | 54 | | |
| 5 | NESTED LOOPS SEMI | | 9 | 54 | 1 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | TEMP_PK | 10 | 30 | 1 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN| TEMP2_PK | 10 | 30 | 0 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | STUDENT_IDX | 1 | 3 | 0 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
if it had done SORT UNIQUE here instead of hash unique considering the cost..
we could have easily avoided the additional PGA related operation while sorting
the final results from the students table...
if we notice it uses pga both the cases... sort order by and hash unique....
-----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)|
E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 2 (100)|
| 10 |00:00:00.01 | 9 | | | |
| 1 | SORT ORDER BY | | 1 | 9 | 54 | 2 (50)|
00:00:01 | 10 |00:00:00.01 | 9 | 2048 | 2048 | 2048 (0)|
| 2 | NESTED LOOPS | | 1 | 9 | 54 | 2 (50)|
00:00:01 | 10 |00:00:00.01 | 9 | | | |
| 3 | VIEW | VW_NSO_1 | 1 | 9 | 27 | 1 (0)|
00:00:01 | 10 |00:00:00.01 | 5 | | | |
| 4 | HASH UNIQUE | | 1 | 9 | 54 | | |
10 |00:00:00.01 | 5 | 2294K| 2294K| 1064K (0)|
| 5 | NESTED LOOPS SEMI | | 1 | 9 | 54 | 1 (0)|
00:00:01 | 10 |00:00:00.01 | 5 | | | |
|* 6 | INDEX RANGE SCAN | TEMP_PK | 1 | 10 | 30 | 1
(0)| 00:00:01 | 10 |00:00:00.01 | 1 | | | |
|* 7 | INDEX UNIQUE SCAN| TEMP2_PK | 10 | 10 | 30 | 0
(0)| | 10 |00:00:00.01 | 4 | | | |
|* 8 | INDEX UNIQUE SCAN | STUDENT_IDX | 10 | 1 | 3 | 0
(0)| | 10 |00:00:00.01 | 4 | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Please do let me know if I missed something... regarding this entire thing...
Thanks,
Vishnu
On Tue, Jan 7, 2020 at 4:32 PM Jonathan Lewis
<jonathan@xxxxxxxxxxxxxxxxxx<mailto:jonathan@xxxxxxxxxxxxxxxxxx>> wrote:
example of hash function that kicks in during execution:
select roll from randomload where roll in (select a.roll from temp a, temp b
where a.roll = b.roll) // this case sorting doesn't happen - only hashing to
eliminate duplicates...
select roll from randomload where roll in (select a.roll from temp) // both
sorting and hashing to eliminate the duplicates .