Re: Periodic Stats Collection -- CBO Stats Myth?

  • From: Tim Gorman <tim@xxxxxxxxx>
  • To: Sriram Kumar <k.sriramkumar@xxxxxxxxx>
  • Date: Fri, 03 Oct 2008 13:12:44 -0600

Sriram,

If I had time and wanted to find out, I would SQL trace (level 12) a run of a DBMS_STATS.GATHER_TABLE_STATS job with ESTIMATE_PERCENT=>10 and METHOD_OPT=>' FOR COLUMNS <column-name>', and find the underlying SELECT statement run by DBMS_STATS on the column-name in question.  Then, I'd copy/paste the text of that SQL statement into SQL*Plus and run it, and see what data is returned by the query.  And that would provide the same information that DBMS_STATS is using to calculate "high value" for the column.

Although I don't have time for this right now, perhaps you do?  If so, would you post your results back to the list?

If I had to guess, then I would say that sampling will result in inaccurate values for column high-value.  Just how inaccurate is part of the game-theory that we all use when deciding between COMPUTE STATISTICS and ESTIMATE STATISTICS.  But that's just my guess...

Thanks!

-Tim



Sriram Kumar wrote:
Hi Tim,
 
Thanks for your inputs. Appriciate your inputs.
 
I had one more question. If we are performing sampling ( lets say 10 percent), how does the stats gatherer estimate the high value especially for date columns?.
 
Best Regards
 
Sriram Kumar
 


 
On Wed, Oct 1, 2008 at 11:20 AM, Tim Gorman <tim@xxxxxxxxx> wrote:
For changing just the high-value column-level statistics, don't waste time and resources re-gathering on the table, but instead set them manually using DBMS_STATS.SET_COLUMN_STATS.  You can just query the current SYSDATE value or query the NEXTVAL of the sequence in question to know the current high-value, so finding the values to set is cheap.

You'll have to use the PREPARE_COLUMN_VALUES procedure to put the histogram endpoint(s) into PL/SQL tables for the SET_COLUMN_STATS call first, so there is some involved PL/SQL coding involving collections of record types.

Hope this helps...



Sriram Kumar wrote:
Hi William/Andrew,
 
Thanks for your valuable inputs.
 
I totally agree with you william on the potential impact of outdated low_value and high_value on date columns and columns that are generated using sequences.
 
I have a follow up question on how this data is populated. If I had done an estimate percent say 10%, how does CBO guess the high value?.
 
Best Regards
 
Sriram Kumar
 


 
On Wed, Oct 1, 2008 at 6:54 AM, William Robertson <william@xxxxxxxxxxxxxxxxxxxx> wrote:
I've found that dates are more likely to present a problem because sequentially generated keys are usually highly selective anyway. Say a financial application loads three million trades a day. If the stats are not updated the optimizer will think a given TRADE_ID has 0 occurrences (since it's after the column's high endpoint) and round that up to 1, which may well be correct anyway. However when you run a query for this week's trades on a Friday and it thinks there are none when in fact there are 15 million then you have a much bigger problem. That's my experience anyway.


2008/10/1 Niall Litchfield <niall.litchfield@xxxxxxxxx>
dates and sequence driven key columns both exhibit this behaviour.


On 30/09/2008, William Robertson <william@xxxxxxxxxxxxxxxxxxxx> wrote:
> I'm not sure how using or not using histograms makes any fundamental
> difference to your approach.
>
> One thing that comes to mind is the use of date predicates. If the
> optimizer knows the highest sale date you have (for example) is 30 Sept
> 2008, you are fine today and maybe some of next week, but in a few
> months' time any queries for "last week's sales" may get unrealistically
> low cardinality estimates leading to inappropriate index/nested loop
> access paths, Cartesian joins and so on.
>
> If you use partitioning and you have date columns as
> partition/subpartition keys you could have even more fun in store.
>
>
> -----Original message-----
> From: Sriram Kumar
> Date: 30/9/08 22:41
>> Hi Folks,
>>
>>        We had migrated a OLTP/Batch hybrid 9i RBO application to 10g
>> and after rounds of tuning the application now is working at its best.
>> We had experimented with various sampling  rates for statistics for
>> various tables and now I feel we are in a optimal point of performance
>> where CBO is picking up the right Indexes.
>>
>> 1) We do not use histograms
>> 2) From here on the application volumes are bound increase proportionately
>> 3) No new tables would be added in production.
>> 4) As of now, We have locked the schema stats for the application
>> schemas. The default scheduler job collects stats for all other schemas
>>
>> Since the increase in volume of data is quite proportional to the
>> existing data, what would be the compelling reason to periodically
>> recollect statistics for the tables that already have good execution
>> plans?
>>
>> e.g lets say that there is a table with million records and with few
>> indexes. We have collected the stats for this table and indexes with
>> million records and all the access paths to this table has been
>> validated that the CBO is picking up the correct optimal index in
>> every case.
>>
>> Now the volume of the table increases to 2 million and from
>> application point of view, the access paths that were valid for 1
>> million would be the valid for 2 million as well.
>>
>> Given this scenario, is it required to collect statistics for 2
>> million as well?. I opine that it would not be required but some of my
>> colleagues feel that we should collect statistics periodically but I
>> am not able to get a valid reason for periodically recollecting the
>> stats?. Is this one of the myths?
>>
>> I would agree to periodically recollect stats if we are using
>> histograms but we are not using histograms. Any other reason that
>> would need a periodic statistic gathering?



-- //www.freelists.org/webpage/oracle-l

Other related posts: