Alternative to V$PQ_TQSTAT for troubleshooting unbalanced HJs?

  • From: "Tornblad, John" <JTornblad@xxxxxxxxxx>
  • To: <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 10 Aug 2012 13:12:43 -0500

I have a large INSERT ... SELECT parallel query involving dozens of hash
joins, hundreds of millions of rows and strongly suspect we are getting
bitten by an imbalance of workload distributed among the parallel
servers for certain HJs, but not all.  The suspicion has been sparked by
watching our query in OEM's Activity window on the query... there are
"good" periods of time where clearly the parallel slaves are working
jointly and equally on a join... and other "bad" periods where a single
parallel slave is chewing away by itself for an hour.
 

Looking at the outline data on the query, I can see dozens of these
three distribution modes

 

PQ_DISTRIBUTE(... BROADCAST NONE)

PQ_DISTRIBUTE(... NONE BROADCAST)

PQ_DISTRIBUTE(... HASH HASH)

 

Although I suspect the CBO is making good decisions for most of these
(we have skewed data and good stats including histograms for skewed
columns) I am wondering if on a few of those joins we'd be better served
by a BROADCAST vs. a HASH... perhaps dramatically so.

 

What I don't know is a lot...

 

First, I am not assuming that all parallel slaves should necessarily be
just as busy as each other all the time... our query includes WINDOW
SORTs and plenty of other goodies and expressions that I have no idea if
they (individually or when composited into our query) can be
parallelized or not in 11.2.0.3 or how they might appear in what I am
watching in OEM or in various V$ views on the running query.

 

Second, I read with excitement these posts...

 

http://oracledoug.com/px6.html

http://jonathanlewis.wordpress.com/2007/03/14/how-parallel/

 

which turned me on to V$PQ_TQSTAT but was deflated by Jonathan's caveat
"unfortunately, v$pq_tqstat tends to go wrong at exactly the moment when
the queries get tough".  Confirmed.  18 hours later (an interminable
wait!) a look into V$PQ_TQSTAT from the same session directly after a
run of the query completed showed it's busted... one row (expected this
to be hundreds).  As I write this up, I am wondering if I needed to look
at V$PQ_TQSTAT before I COMMITed the results but I'm not sure if I can
survive the capricious let down of a potential heisenbug again.

 

I've looked at _PX_TRACE but from what I've scanned in this
lightly-documented area there doesn't appear to be a variation of
options that gives equivalent output to V$PQ_TQSTAT showing the rows
delivered to each parallel server process.  Our should I say... an
easy-to-read equivalent.  There is plenty of output... but no docs on
how to interpret it or where to look for what.

 

To top it off there appears to be a bug in 11.2.0.3 / OEM 12c SQL
Monitoring such that something about our query makes it not show up at
all in OEM SQL Monitor (or in V$SQL_MONITOR) so I can't easily look when
the slow operation is running and just "see" what operation it's on and
try to work that back to what specific join might be causing an
imbalance of workload.

 

 

I'm not out of options... I think the most promising at this point is to
try a trivial join test case with some version of _PX_TRACE on, and try
to interpret the trace output and see where the numbers regarding TQ row
counts might be.

 

I'd appreciate anyone's comments, advice or even some commiseration at
this point.

 

-john

 



</pre>This message is confidential, intended only for the named 
recipient(s) and may contain information that is privileged or 
exempt from disclosure under applicable law.  If you are not 
the intended recipient(s), you are notified that the 
dissemination, distribution, or copying of this message is 
strictly prohibited.  If you receive this message in error or 
are not the named recipient(s), please notify the sender by 
return email and delete this message. Thank you.
--
//www.freelists.org/webpage/oracle-l


Other related posts: