Re: Sorting Large Amounts of Data- Oracle to SAS, Part Deux

  • From: Kellyn Pedersen <kjped1313@xxxxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx, info@xxxxxxxxxxxxxxxxxxxxx
  • Date: Fri, 14 May 2010 08:06:45 -0700 (PDT)

Hello Randolf!
I actually used Johnathan's example to help my lead mart developer work through 
the initial concepts of the sorted hash cluster-  I agree, excellent example!
 
I will also admit-  I was at the point of trying most anything in an attempt to 
get SAS to stop performing an order by after the group by, (yes, you are right 
again abou the pivot of the data! :))  I also see a mistake in my explanation-  
it's not 1.4 million rows-  try 1.4 billion rows pivoted and then ordered... :(
 
My heart pretty much sank when the request was then made of me to find a way to 
get parallel out ot this, even though my lead mart developer and I had told 
them we knew of know such way to do this.  The ideas you have given here are 
excellent ones and I'll keep them on the side table, just in case, but it looks 
like we are onto Cosort at this point to pull the data to Linux and let it sort 
it, (why oh why must we work hard instead of working smart...anyone?  Anyone?  
Bueller? :))
 
Thank you again, this is excellent advice!


Kellyn Pedersen
Sr. Database Administrator
I-Behavior Inc.
http://www.linkedin.com/in/kellynpedersen
www.dbakevlar.blogspot.com
 
"Go away before I replace you with a very small and efficient shell script..."

--- On Fri, 5/14/10, Randolf Geist <info@xxxxxxxxxxxxxxxxxxxxx> wrote:


From: Randolf Geist <info@xxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Sorting Large Amounts of Data- Oracle to SAS, Part Deux
To: oracle-l@xxxxxxxxxxxxx
Date: Friday, May 14, 2010, 8:24 AM



> Thank you Greg, you have just verified my fears of all my research...sniff, 
> sniff..
>  
> This is a simple select with a pass thru query from SAS, over the wire.  It 
> is 
> 1.4 million rows, using a sorted_hash_cluster to sort the data for them so 
> they 
> won't do a group by and an order by in their query.
>  
> The statement with both, without the sorted hash cluster takes an unknown 
> amount of time-  they just couldn't get it to complete.  The sorted hash 
> cluster completes in 4 hrs.
>  
> Of course, now the boss wants to know-  can I incorporate parallel?  So much 
> for that idea...  We are pretty far into this design at this point, but any 
> other ideas are always welcome!
> Thanks!

Kellyn,

I know this is a bit late but if you think about how sorted hash clusters seem 
to work under the covers, which Jonathan Lewis has done a great job to explain 
here:

http://jonathanlewis.wordpress.com/2009/10/05/hash-clusters-2/ (follow the 
links in this summary post)

the big difference between using a conventional heap table and the sorted hash 
cluster is that Oracle under the covers uses a two-pass a pproach for the 
sorted hash cluster:

1. First it retrieves a minimum information from the cluster (basically the 
ROWID and the sort key), sorts this information (so there is usually a sort 
operation involved that is not exposed in the plan) and then 

2. In a second pass retrieves the rows using the sorted result including the 
ROWID generated in the first pass

The point here is that this minimizes the amount to sort but potentially drives 
up the I/O since you retrieve every row by single block ROWID access. The 
effectiveness of this operation depends then on the clustering factor of the 
sort and the data in the cluster (which is *not* stored in a sorted order).

It should also be noted that the sorted hash cluster comes at a price of an 
overhead (and potential space wastage), including an automatically created 
index that has to be maintained, see Jonathan's description for more details.

Jonathan shows then here: http://jon 
athanlewis.wordpress.com/2008/04/27/manual-optimisation/

that you can attempt to mimic something like that yourself. In your case using 
this manual approach you might be able to parallelize the first pass that gets 
and sorts the ROWIDs. I'm unsure however if you can parallelize the second 
pass, since it theoretically depends on accessing the rows in the order defined 
in the first pass. Doing this in parallel should screw up this intended order, 
but Jonathan mentions in one of the comments to his posts that to his surprise 
it still worked when he tried to run it in parallel - so you might want to give 
it a try.

In your case it looks like you need additionally a group by clause (presumably 
to pivot some data?), and I think this is where the manual approach described 
by Jonathan might hit current limitations of the optimizer, which means that I 
think that you are not able to tell the optimizer that the GROUP BY is supposed 
to be a NOSORT operation. This is something that the sorted hash cluster can do 
and the optimizer knows about, but the manual optimisation probably can't. 
Still, if the data is already going pre-sorted into a GROUP BY operation it 
might turn out to be quite efficient.

So in your particular case where you say that it takes 4 hours to complete you 
would need to understand where these 4 hours are spent:

- If the majority of the time is spent on the network transferring the data to 
the SAS client, there is probably no point in further optimisation (as outlined 
by Greg)
- If the majority of the time is spent on performing the first pass (the sort 
of the minimum data set) then the manual approach might give you some room for 
improvement
- If the majority of the time is spent on performing the second pass (the row 
lookup by ROWID) then it depends on what I've outlined above

Regards,
Randolf

Oracle related stuff blog:
http://oracle-randolf.blogspot.com /

Co-author of the "OakTable Expert Oracle Practices" book:
http://www.apress.com/book/view/1430226684
http://www.amazon.com/Expert-Oracle-Practices-Database-Administration/dp/1430226684  ;







NEU: WEB.DE DSL für 19,99 EUR/mtl. und ohne Mindest-Laufzeit!   
http://produkte.web.de/go/02/-- //www.freelists.org/webpage/oracle-l 


      

Other related posts: