Re: sane number of the table partitions in DWH

  • From: Milen Kulev <makulev@xxxxxxx>
  • To: greg@xxxxxxxxxxxxxxxxxx
  • Date: Sun, 05 Apr 2009 20:52:51 +0200

Hello Greg ,
I agree, the "bitmap conversion filter ratio" 10M -> ~ 800 is really bad.
My customer decided to implement a new partitioning scheme with that many partitions (~ 80k).
This should solve the inefficient "bitmap conversion filter ratio" problem.
Many thanks to all  who contributed to this thread.

Best Regards.
Milen

Greg Rahn wrote:
I would agree with Riyaj's comments.  Simply put, 10 million BITMAP
CONVERSION TO ROWIDS is way too many to be well performing.
Personally, I would probably draw the line at 250K as a reasonable
number, before going back to revisit the physical design options.

I would also comment that parse time is probably the least of your
worries.  Personally I have never seen it be a significant issue in a
warehouse.  Do be aware of the segment header caching issue, but given
a frequently used set of partitions and a large enough buffer cache
and the bug fix, that should be a non-issue as well.

Lets talk some requirements and design:

 Data volatility :
 This is an operational DWH database. Data is almost steadily loaded (mainly 
INSERTs + UPDATEs).
 Bitmap indexes are getting bloated/corrupt and must be often rebuilded.
 Furthermore, data in the fact tables could not be preaggregated and/or 
compressed (because of steady data changes)

Furthermore, data in the fact tables could not be preaggregated and/or 
compressed (because of steady data changes)

The minimum time granule for reports is 1 week. Typicall time granule is 3 
months.
There are users who need 1 month time granules. Many customers are accessing 
this resouce base
and each customer has its on requirements in terms of time aggregations, region 
aggregations, product type
aggregations.

These two requirements seem to conflict somewhat in my mind.  If the
minimum time granule is 1 week, why not bulk load daily?  What is the
requirement for trickle feed if there is not up-to-the-minute
reporting?  I bring this up for a few reasons.  First, compression.  I
consider the use of table compression a requirement for data
warehousing (the other two being partitioning and parallel query).
The exception being if the compression ratio is too low to reap gains
from it (say below 2:1).  Since you are not on 11g, Advanced
Compression is not an option.  Compression is not about space savings
per se, it is about reduction in data transfer.  Think: why do people
compress/zip files for email or downloads?  It is not to save space.
It reduces the bandwidth requirement for the file transfer, which is
frequently the problem resource in data warehouses.

The idea is to have many partitions, so that more granular partition pruning 
can occur.
Each partition should be read via FTS using multiblock reads (MBRC=16)

I'm all for good partitioning and full partition scans using parallel
query, but you must also consider if you have inserts/updates going on
that a query will force all the dirty blocks to be flushed to disk so
that PQ can read them back via a direct (physical) read.  This may be
problematic depending on how frequently queries are being executed and
brings me back to the comments about the requirement for trickle feed.
 Also I would leave your MBRC unset and let it default.  This will
give you the optimal I/O size (1MB) for FTS.

My cusormer plans to use RANGE->List partionining scheme.
Range on time (period_id),  280 Partitions (slowly changing over time )
list on product type (domain_product_id), 300 distinct values (fix).

This we should get 280x300 = 84000 partitions.  I *suppose* we should get 
reasonable parse times.

This sounds reasonable, but again, I don't think your worry should be
parse times.  Especially since I have seen zero evidence that this is
actually an issue.

When it comes to query acceleration for data warehousing it basically
comes down to a few things.  First, for keyed access (single record
look-ups) indexes work great.  For analytic queries (big aggregations)
its all about partitioning and choosing your partition and
subpartition keys wisely thus to maximize data elimination.  After
that, toss in some parallel query (PQ) to scan those partitions in
parallel.  This will take hardware resources, mainly CPU and I/O
bandwidth.  If you need more acceleration yet, then work smarter: look
into building aggregations using materialized views and query rewrite
or buy more hardware.

You can use brute force (parallel query and full partition scans) or
brains (mat views and query rewrite), or a mix of both.  All depends
on the requirements.


Other related posts: