Re: Sorting Large Amounts of Data- Oracle to SAS, Part Deux

  • From: Randolf Geist <info@xxxxxxxxxxxxxxxxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Fri, 14 May 2010 16:24:02 +0200 (CEST)

> Thank you Greg, you have just verified my fears of all my research...sniff,
> sniff..
>  
> This is a simple select with a pass thru query from SAS, over the wire.  It is
> 1.4 million rows, using a sorted_hash_cluster to sort the data for them so they
> won't do a group by and an order by in their query.
>  
> The statement with both, without the sorted hash cluster takes an unknown
> amount of time-  they just couldn't get it to complete.  The sorted hash
> cluster completes in 4 hrs.
>  
> Of course, now the boss wants to know-  can I incorporate parallel?  So much
> for that idea...  We are pretty far into this design at this point, but any
> other ideas are always welcome!
> Thanks!

Kellyn,

I know this is a bit late but if you think about how sorted hash clusters seem to work under the covers, which Jonathan Lewis has done a great job to explain here:

http://jonathanlewis.wordpress.com/2009/10/05/hash-clusters-2/ (follow the links in this summary post)

the big difference between using a conventional heap table and the sorted hash cluster is that Oracle under the covers uses a two-pass approach for the sorted hash cluster:

1. First it retrieves a minimum information from the cluster (basically the ROWID and the sort key), sorts this information (so there is usually a sort operation involved that is not exposed in the plan) and then

2. In a second pass retrieves the rows using the sorted result including the ROWID generated in the first pass

The point here is that this minimizes the amount to sort but potentially drives up the I/O since you retrieve every row by single block ROWID access. The effectiveness of this operation depends then on the clustering factor of the sort and the data in the cluster (which is *not* stored in a sorted order).

It should also be noted that the sorted hash cluster comes at a price of an overhead (and potential space wastage), including an automatically created index that has to be maintained, see Jonathan's description for more details.

Jonathan shows then here: http://jonathanlewis.wordpress.com/2008/04/27/manual-optimisation/

that you can attempt to mimic something like that yourself. In your case using this manual approach you might be able to parallelize the first pass that gets and sorts the ROWIDs. I'm unsure however if you can parallelize the second pass, since it theoretically depends on accessing the rows in the order defined in the first pass. Doing this in parallel should screw up this intended order, but Jonathan mentions in one of the comments to his posts that to his surprise it still worked when he tried to run it in parallel - so you might want to give it a try.

In your case it looks like you need additionally a group by clause (presumably to pivot some data?), and I think this is where the manual approach described by Jonathan might hit current limitations of the optimizer, which means that I think that you are not able to tell the optimizer that the GROUP BY is supposed to be a NOSORT operation. This is something that the sorted hash cluster can do and the optimizer knows about, but the manual optimisation probably can't. Still, if the data is already going pre-sorted into a GROUP BY operation it might turn out to be quite efficient.

So in your particular case where you say that it takes 4 hours to complete you would need to understand where these 4 hours are spent:

- If the majority of the time is spent on the network transferring the data to the SAS client, there is probably no point in further optimisation (as outlined by Greg)
- If the majority of the time is spent on performing the first pass (the sort of the minimum data set) then the manual approach might give you some room for improvement
- If the majority of the time is spent on performing the second pass (the row lookup by ROWID) then it depends on what I've outlined above

Regards,
Randolf

Oracle related stuff blog:
http://oracle-randolf.blogspot.com/

Co-author of the "OakTable Expert Oracle Practices" book:
http://www.apress.com/book/view/1430226684
http://www.amazon.com/Expert-Oracle-Practices-Database-Administration/dp/1430226684
  

NEU: WEB.DE DSL für 19,99 EUR/mtl. und ohne Mindest-Laufzeit!   
http://produkte.web.de/go/02/
-- //www.freelists.org/webpage/oracle-l

Other related posts: