RE: HASH Partitioning question

  • From: "Mark W. Farnham" <mwf@xxxxxxxx>
  • To: <dmarc-noreply@xxxxxxxxxxxxx>, <tim@xxxxxxxxx>, "'ORACLE-L'" <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 13 Feb 2015 09:46:26 -0500

First, +42 on Tim’s answer. That is as complete an answer as you’re going to 
get for the generic case.

 

Now that we know you’re actually trying to avoid this change and likely you’re 
using rolling exchange partition method based on date, and that you are 
apparently doing a two table join, we can be a bit more specific:

 

1)    IF we know the particular reason why the person making the suggestion 
thinks a move to hash partition would be better (probably some specific query), 
then we could possibly give you aid in determining if the theory holds enough 
water to warrant a test.

2)    IF your major demand query is a union over range pairs of the tables 
matching the partitioning of both tables, then it is almost certain that you 
can do a major performance improvement by generating the sets of range specific 
joins strung together with union alls (or possibly even concatening piecewise 
generated single partition single range joins report results if the major order 
of the report is date and there is not overall totaling. In this fortunate 
event you could use manual piecewise parallelism on each partition range and 
only the final filewise concatenation would be serial. When such semantic 
manual parallelism is possible no PX overhead is required because they are 
separate non-parallel operations from Oracle’s viewpoint. If you’re producing 
data in a format for archiving or loading to another system, then you don’t 
even need to jamb the pieces together, you just need a reasonable output 
filename nomenclature and possibly a plan for making sure the pieces are loaded 
in the desired order or in parallel.) Please be certain to understand why union 
all is guaranteed correctly generating no inter-partition duplicates for 
matching partition ranges not only for equijoins but also for range matches 
matching the partition boundaries and that the all is required to avoid massive 
overhead slapping the pieces together with de-duplication.

3)    IF (same as 2), it may be possible there is an additional secondary 
partitioning that will dramatically improve your join mechanics if your major 
reports are date first, something else second, with the “something else second” 
dominant enough in requests for it to be worthwhile to justify the extra work 
to subpartition.

 

good luck,

 

mwf

 

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of Deepak Sharma (Redacted sender "sharmakdeep_oracle@xxxxxxxxx" for 
DMARC)
Sent: Friday, February 13, 2015 12:55 AM
To: tim@xxxxxxxxx; ORACLE-L
Subject: Re: HASH Partitioning question

 

Hi Tim,

 

Some very good tips and explanation.Thanks.

 

The 6TB arch logs is for the whole DB. We've got some tables that are 5-7TB in 
sizes, and several others that are smaller but in 100-300gb range.

 

We do use direct-path loads (picked-up from your seminar in 2005), along with 
compress on target tables, otherwise the DB size would be even higher (avg 
compress ratio we get is 60%-70%).

 

Let me try to re-frame the question.

 

Say, we have 2 tables (5TB each):

First one is RANGE partitioned and the other HASH partitioned (no 
Sub-Partition), on the same DATE_KEY column.

 

When a query that uses specific DATE_KEY value or values, in case of Range, it 
would prune to the relevant partition(s).

 

However, (here's the doubt), in case of HASH, the same query would not Prune to 
specific partition(s), and the query will need to use Parallel, to access all 
the partitions. 

 

Personally, I want to keep the table as is with Range partitioned, but need to 
give an explanation why changing it to HASH would cause higher resource 
consumption (by enabling parallel for all the users).

 

 

-Deepak

 

 

 

 

 

 

On Thursday, February 12, 2015 10:09 PM, Tim Gorman <tim@xxxxxxxxx> wrote:

 

Deepak,

One more question:  when you say "changing it to HASH partition" from the 
present RANGE partitioning, do you mean changing it from RANGE to RANGE-HASH 
sub-partitioned, or just from RANGE to HASH, no sub-partitioning.

I think that the latter choice, changing from RANGE to HASH no 
sub-partitioning, is going to ruin your data loading scheme, if you're using 
the EXCHANGE PARTITION load technique.  If you're not using the EXCHANGE 
PARTITION load technique and I suspect that might be the case after you state 
"no batch window for loads", then that may explain the 6 TB of archive logs per 
day because you're unable to use direct-path loads, but that's a whole other 
matter, isn't it?

Anyway, hopefully you're planning to change from RANGE to RANGE-HASH 
sub-partitioned, so that you can continue to load by date range and purge by 
date range.

At any rate, the architecture of PX is relevant to this question.  There is a 
single query-coordinator (QC) session and "N" parallel-execution worker 
sessions.  The QC session serves as a collation point for all the data returned 
from the worker sessions.  So, parallel-execution queries can work well when 
you're scanning a large volume of data, but only returning a small number of 
rows.  That is their primary use-case - throwing lots of CPU and memory at the 
problem at once.

Now what will happen if you have a parallel query that is scanning a large 
volume of data and returning a large number of rows?  The single QC session 
will be overwhelmed with huge volumes of data being returned from all of the PX 
worker sessions, and so queuing will result.  In other words, the QC becomes 
the bottleneck, and the total elapsed time of the parallel query drops to match 
the capability of that single QC session to return rows.  In other words, not 
very fast.  In this situation, it is probably faster to dispense with parallel 
execution altogether and just run a serial query.

So, if your proposed query is returning a large number (i.e. millions or 
billions) of rows, then you're just plain doomed.  PX won't help.  Indexes and 
partitioning won't help.  At that point, it is probable that it isn't your 
query that needs to be tuned, but your application logic.  After all, what use 
is millions or billions of rows unless they're all going to another table, in 
which case you should use INSERT ... SELECT.  If you're just SELECTing them and 
then displaying them on a screen, nobody is ever going to look at them all, so 
what's the point?  Likewise if you're going to print them in a report.

However, if your proposed query is returning a small number of rows after 
scanning a huge number of rows, then you're either filtering or aggregating or 
both.  If you're filtering, then enabling partition-pruning or indexing can be 
your best bet.  If you end up using indexing for filtering, then parallel 
execution is not likely to work well.  If you enable partition-pruning (either 
by the RANGE partition-key or the HASH sub-partition-key), then you can still 
do FULL table scan with parallel execution, but now against a smaller volume of 
data, which will be faster.

Hopefully that helps?

Thanks!

-Tim



On 2/12/15 19:07, Deepak Sharma wrote:

It's a DW environment close to 150TB with almost 90% of tables partitioned, 
generating 6TB archive logs a day (24x7 loading with no batch window for loads).

 

On Thursday, February 12, 2015 8:02 PM, Deepak Sharma  
<mailto:dmarc-noreply@xxxxxxxxxxxxx> <dmarc-noreply@xxxxxxxxxxxxx> wrote:

 

Sure Tim (btw, I attended your DW seminar several years ago in Minneapolis and 
have implemented a few ideas that have really helped).

 

600gb table with 14 Billion rows. It's currently partitioned on a loading date 
column by range, and has a rolling window retention. Someone suggested changing 
it to HASH Partition on another column (lets say ABC) which is more frequently 
used, but may not have even distribution (so Range would have skewed data). My 
guess is that in order for the queries to use that new ABC HASH partitioned 
column (it could be equi-join, In-List etc.), PX is needed. 

 

Let me know if you need more details.

 

-Deepak

 

On Thursday, February 12, 2015 3:19 PM, Tim Gorman  <mailto:tim@xxxxxxxxx> 
<tim@xxxxxxxxx> wrote:

 

C'mon, you need to give us more than that, please?


Query for one row?  Query for 10 million rows?  Aggregating?  Not aggregating?  
Exadata or laptop or something in between?  Oracle7?  Oracle8i?  Windows?  
Linux?  Android?




On 2/12/15 13:55, Deepak Sharma (Redacted sender sharmakdeep_oracle@xxxxxxxxx 
for DMARC) wrote:

Is it true that if a table (say 1 billion rows) is HASH Partitioned, then the 
most efficient way to query it needs to use Oracle parallel threads ?

 

 

 

 

 

Other related posts: