Re: sane number of the table partitions in DWH

  • From: "Milen Kulev" <makulev@xxxxxxx>
  • To: Greg Rahn <greg@xxxxxxxxxxxxxxxxxx>
  • Date: Wed, 01 Apr 2009 14:05:40 +0200

Hallo Greg,
many thanks for your reply .
Due to a lack of time I really haven't provided much details.
Now the details :  This is 8 nodes RAC Cluster (Oracle version 10.2.0.4, OS = 
Suse9 SP4)
All nodes have between 64-128GB RAM und 16-32 CPUs.

A typical SQL Statements looks like this (very simplified version, dimention 
tables  and many GROUP BY clauses are left off, since they
add up very little to the total response time. All the tables (1 fact, 2 
dimentions) are joined via HASH joins
together):

SELECT 
     t1.project_id project_1,
     t1.productgroup_id  productgroup_id_d, 
      t1.period_id period_id_d,
      t1.basechannel_id basechannel_id_d, 
      t1.outlet_id outlet_id_d, 
     MAX(t1.pd_distribution_factor) distributionfactor,
     SUM(t1.pd_sales_units*t1.pd_projection_factor*t1.pd_price_units_eur)  
salesvaluefactor
 FROM
     DWH_RB.rb_fact_pd_out_itm t1
 WHERE
      t1.period_id IN
     
(20070199999030,20070299999030,20070399999030,20070499999030,20070599999030,20070699999030,20070799999030,20070899999030,20070899999060,20070999999030,20071099999030,20071099999060,20071199999030,20071299999030)
   AND t1.country_id IN
     
(15,12,14,13,85,26,16,18,19,86,17,23,25,48,30,87,29,82,49,901,908,51,52,53,45)
   AND t1.domain_productgroup_id IN (32647,32672) 
   AND t1.project_id IN
     
(42175,42377,42495,42496,42497,42498,42500,42502,42514,42517,42519,42521,42523,42525,42531,42532,42550,42552,42556,42559,42644,42646,42647,42649,42651,42653,42667,42668,42753,42754,42836,42838,42853,42859,43030,43031,43943,43944,43964,43966,56248,56249,59669,59670,70615,70618,70620,70623,70625,70627,71220,71221)
   AND t1.productgroup_id IN
     
(15647,15672,12582,12672,14982,14672,20809,20803,24191,24194,26101,26172,16583,16617,18081,18672,19420,19672,17672,23291,23293,25883,25617,48647,48672,30491,30472,87101,87172,29391,29672,82172,82101,49672,49647,901101,901172,17283,908647,908672,51101,51172,52101,52172,53101,53172,45581,45585)
   AND t1.flg_status_id IN ('1','2','4') AND t1.project_type_id IN ('1','3')
 GROUP BY
     t1.project_id,
      t1.productgroup_id, 
      t1.period_id,
      t1.basechannel_id, 
      t1.outlet_id
       ; 

There are single column bitmap indexes on the following columns:
domain_productgroup_id
country_id
period_id

The table DWH_RB.rb_fact_pd_out_itm is the fact table (~ 800- 900GB at the 
moment).
This fact table is partitioned on  (some parts of the definition are left off, 
since they are irrelevant to
the problem):

 CREATE TABLE "DWH_RB"."RB_FACT_PD_OUT_ITM"
   (    "PROJECT_ID" NUMBER NOT NULL ENABLE,
        "PROJECT_TYPE_ID" NUMBER NOT NULL ENABLE,
        "PERIOD_ID" NUMBER NOT NULL ENABLE,
        "PERIODICITY_ID" NUMBER NOT NULL ENABLE,
        "OUTLET_ID" NUMBER NOT NULL ENABLE,
        "BASECHANNEL_ID" NUMBER NOT NULL ENABLE,
        "COUNTRY_CHANNEL_ID" NUMBER NOT NULL ENABLE,
        "CHANNEL_ID" NUMBER NOT NULL ENABLE,
        "COUNTRY_ID" NUMBER NOT NULL ENABLE,
        "ITEM_ID" NUMBER NOT NULL ENABLE,
        "ITEM_VERSION_ID" NUMBER NOT NULL ENABLE,
        "PRODUCTGROUP_ID" NUMBER NOT NULL ENABLE,
        "DOMAIN_PRODUCTGROUP_ID" NUMBER NOT NULL ENABLE,
                ......
   ) 
  PARTITION BY RANGE ("PERIOD_ID")
  SUBPARTITION BY HASH ("PROJECT_ID")
  SUBPARTITION TEMPLATE (
    SUBPARTITION "S1",
    SUBPARTITION "S2",
    SUBPARTITION "S3",
    ....
    ....
So partitioning scheme is range->hash
Period_id is a time dimention.
PROJECT_ID is kind of artificial key , a grouping of DOMAIN_PRODUCTGROUP_ID  
Product_type_id
A datavolume is expected to grow by 30-40% each year. 
The biggest problems (in my opinion) are:
1)  Customer expectations:
database design is for a typical DWH application (star schema)
 The customers  of my customer are expecting response times in 10-20 seonds 
range (at the amoment 
 the response time for such a typical SQL statements (like this above) is ~ 200 
-1000 seconds, depending on 
 the amount of data processed ). 
2)  Data volatility :
  This is an operational DWH database. Data is almost steadily loaded (mainly 
INSERTs + UPDATEs).
  Bitmap indexes are getting bloated/corrupt and must be often rebuilded. 
  Furthermore, data in the fact tables could not be preaggregated and/or 
compressed (because of steady data changes)
3) Many customers are accesing this resource base and  different customers need 
different level 
        of aggregations(the SQL statement above is a kind of template, 
sometimes even more predicates are being generated,
        sometimes the IN lists are longer/shorter, but the core structure of 
the statement is the same).

Here is a excerpt of the 10046 event trace file (processed with orsasrp 
proflier):

Session Flat Profile
==================================================
                                                                             
----------- Time Per Call ---------
Event Name                                  % Time       Seconds      Calls     
     Avg         Min         Max
----------------------------------------  --------  ------------  ---------  
----------- ----------- -----------
db file sequential read                      47.1%     412.9400s    240,657     
 0.0017s     0.0000s     0.3119s
gc cr grant 2-way                            27.1%     237.9957s    225,795     
 0.0010s     0.0000s     0.1638s
FETCH calls [CPU]                            11.2%      98.3720s     42,273     
 0.0023s     0.0000s    96.9783s
SQL*Net message from client                   7.6%      66.4010s     42,190     
 0.0015s     0.0005s    11.0692s
unaccounted-for time                          4.1%      35.6414s
SQL*Net message from client [idle]            1.9%      16.7582s          3     
 5.5861s     2.7191s     9.9124s
gc current block 3-way                        0.6%       5.5963s      2,438     
 0.0022s     0.0002s     0.0223s
gc cr grant congested                         0.2%       1.5404s        382     
 0.0040s     0.0013s     0.0237s   
  
The average access times "db file sequential read" and "gc cr grant 2-way" are 
~ 1msm so the performance (in term of 
latencies) is OK. 

The idea is to have many partitions, so that more granular partition pruning 
can occur. 
Each partition should be read via FTS using multiblock reads (MBRC=16)

The biggest concern is the parse time. At the moment 
the fact table has  ~ 390 partitions. My customer claims that sometimes the SQL 
statements 
are hanging in the parse phase. I couldn't verify this (never happended when I 
was by customer on site),
but the client suspects it *could* have something to do with the amount of 
partitions.

Now my comments to your comments:

"
 Multi-column partitioning can work, however it behaves different when
> it comes to partition elimination based on the type of partitioning
> and how many of the partitioning keys are present in the predicate in
> the query.  T
"

There is no problem to supply all the partitioning keys as predicates. The 
application
accesing the database is in-house written and can be changed accordingly. What 
could not 
be chaged is datavolume and business customer expectations (max. 20 sec 
response time).
 

"
> Based on these numbers I don't really see the benefit for the large
> number of partitions.  A 4TB table with 200K partitions means 20MB
> partitions (assuming even distribution).  That is quite small for a
> partition size.  I would say too small to warrant significant benefit.
>  And since you mention the fact tables are up to 1TB, well, that means
> single MB or smaller for each partition segment.  That is probably too
> small to be overly useful.
"

That is  the point ! Each partitin should be small, so that after partition 
pruning 
each partition  should read 20-100MB . Not that to satisfy a query many 
(hopefully not sooo many ) 
partitions (each ~ 20-100MB) must be read, preferrably using PQ. At the moment 
no PQ occurs.

"
> I would suggest range(date) with a granule size of 1 day because that
> tends to be the unit that business reports frequently use.  Unless you
> are either 1) doing fact-fact joins and want to leverage
> partition-wise joins or 2) single key access via location_id and not
> using indexes, then I would see little benefit from the hash
> subpartitioning at that scale.
"

The minimum time granule for reports is 1 week. Typicall time granule is 3 
months.
There are users who need 1 month time granules. Many customers are accessing 
this resouce base
and each customer has its on requirements in terms of time aggregations, region 
aggregations, product type
aggregations. 

At the moment we are currently discussing the following design:
1) each country gets its own fact table e.g. FACT_USA, FACT_UK etc. There will 
be ~ 60 such fact tables
2) each fact table will be range->list partioned. 
Range partitionig (level1) -> in time dimention (monthly), for the last 4 years 
+ some special time partitions
->   280  partitions 
List partioning -> product type dimentions. There are about 300 different 
product type groups

This, a single country-specific facxt table will have 280x300 = 84000 
partiotions
3) Using country-specific services: 
        If a user wants aggregated UK data, she/he will connect to service UK, 
which is bound to instance1, for example.
        The hope is to use cache locality affects e.g. table FACT_UK will be 
cached only in the database cache of instance1
        The idea is to minimize interconnect traffic (see the profile above). 
 Data loading is always country specific, so there should be little "cache 
polution" on each instance (in the cache of instance1
 should contain solely data from FACT_UK table).

This should minimize the "gc cr %" interconnect traffic. 


Best Regards. Milen 

-------- Original-Nachricht --------
> Datum: Tue, 31 Mar 2009 09:47:42 -0700
> Von: Greg Rahn <greg@xxxxxxxxxxxxxxxxxx>
> An: makulev@xxxxxxx
> CC: oracle-l@xxxxxxxxxxx
> Betreff: Re: sane number of the table partitions in DWH

> My first comments would be if you are going to have an extreme design,
> then your testing should also be extreme, in both scale and depth.
> 
> > The number of calculated/estimated partitions is 200 000 - 300 000 (Yep,
> this is NOT a typing error !).
> 
> Is this total for the db or per table?
> 
> > 1) What is the "manageable" number of partitions (from performance point
> of view == parse times) that I you have seen by clients.
> 
> The number of partitions does affect parsing, but it also has an
> impact on the buffer cache.  If the segment header is not in the
> buffer cache the execution will take longer than if it was (slightly
> obvious).  The not obvious part of this is that if you are using
> parallel query, the query coordinator reads the extent maps serially
> so there can be a long time between start of query and when the PQ
> slaves start their work, unless you are on 11.1.0.7 or have bug
> 6525904 fix as a one-off.  To put this in perspective, for a FTS of a
> table of 92,160 segments I have seen this "stall" take 30 minutes
> prior to the bug fix.
> 
> > 2) Any parameters (also underscore ...) to reduce the hard-parse times
> (because of the enormous amount of partitions) ?
> 
> Set your buffer cache large enough to keep all the segment headers in it.
> 
> > 3) If I have so many partitions , what is the influence on the
> fragmentation of the shared pool? Any experiences ?
> 
> I don't think this is really an issue.  Warehouses usually hard parse
> everything, use it once, and it ages out fast, so there is likely
> little need for more than a few GB of shared pool memory.
> 
> > 4) Is multi-column partitioning  a save bet (bugs) ?
> 
> Multi-column partitioning can work, however it behaves different when
> it comes to partition elimination based on the type of partitioning
> and how many of the partitioning keys are present in the predicate in
> the query.  Test this well noting partition elimination before
> deciding on a design.
> 
> > Fact tables have sizes ~ 700 - 1000 GB.
> > Application type = DWH ( ~ 4.5 TB)
> > At the moment partitioning schmeme is range(date)/hash(location_id)
> 
> Based on these numbers I don't really see the benefit for the large
> number of partitions.  A 4TB table with 200K partitions means 20MB
> partitions (assuming even distribution).  That is quite small for a
> partition size.  I would say too small to warrant significant benefit.
>  And since you mention the fact tables are up to 1TB, well, that means
> single MB or smaller for each partition segment.  That is probably too
> small to be overly useful.
> 
> I would suggest range(date) with a granule size of 1 day because that
> tends to be the unit that business reports frequently use.  Unless you
> are either 1) doing fact-fact joins and want to leverage
> partition-wise joins or 2) single key access via location_id and not
> using indexes, then I would see little benefit from the hash
> subpartitioning at that scale.  Of course, this is based off very
> little provided information, so take it with a grain of salt.
> 
> -- 
> Regards,
> Greg Rahn
> http://structureddata.org

-- 
Neu: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate + Telefonanschluss 
für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
--
//www.freelists.org/webpage/oracle-l


Other related posts: