RE: Solaris T5220 server problem

  • From: Wolfson Larry - lwolfs <lawrence.wolfson@xxxxxxxxxx>
  • To: "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 27 May 2011 23:31:33 +0000

AN update on this.   The problem is the way Solaris tries to build larger pages 
for new processes.
When the server has been up for a long time memory is severely fragmented.  
When a new process is started the OS tries to provide large pages and since it 
can't,
it keeps trying and trying until it finally coalesces  a bunch of small ones.  
And then it does the same for the next process and the next.
  We found another client with their server up over 2 years and experiencing 
the same problem.  Fortunately they needed some patching and after the reboot 
they were stunned at the improved  performance,   We've talked both of them 
into quarterly reboots and the first one has the page coalescing tuned off.

It isn't in any Solaris doc that we could see.  If anyone else has more 
information let me know.
Although it's not documented you can see setting with mdb command.

   As for being dynamic we tried it twice and both times the servers crashed.   
Fortunately everything had been shutdown prior for a reboot and we weren't 
impacted.
So just watch out for that.

It's easy to spot especially if you have more than one sever.  Just do a truss 
- c sqlplus on both with a quick script and you'll see the difference.

  Larry

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of Wolfson Larry - lwolfs
Sent: Wednesday, April 27, 2011 7:15 PM
To: oracle-l@xxxxxxxxxxxxx
Subject: Solaris T5220 server problem

Hello!
            Finally convinced client long running code wasn't database, 
application, network problem.

Noticed when I was running one of my queries, that usually  runs in a tenth of 
a second elapsed time, was taking about 8 seconds on production server
8G, 32 CPUs with both 10.2.0.4 prod & test (separate ORACLE_HOMES) on same 
server.

Wanted Unix admin to run some type of Dtrace.   I had already run truss a 
number of times.
Didn't get that, but SA found  echo was running about 30-60 times longer on 
this server than dozens of others we manage (most not T5220s).
They ran GUDS, which didn't help and then support person came up with this from 
a buddy he reached out to.


He suggested turning page coalescing off, which we found to be beneficial in 
many performance escalations.  This is something you can do on the fly and if 
it's found to have a desirable effect, it can be permanently set in 
/etc/system. There are no know downsides to doing this in the real world.



Once this is enabled, could your DBA's run some test jobs which can  be 
compared against timings for the same jobs when the test DB is down?



Here are the dirty details from previous communications on the topic:

quote --->

Large pages are not a problem. It is finding or coalescing them when none is 
available needs improvment. LPOOB feature is designed to improve application 
out of box performance. There are number of LPOOB fixes already been integrated 
in Sol10 U4 and more are planned for U5 and U6.



It is wiser to disable coalescing than disable LPOOB. If you don't want page 
coalescing then set following tunables dynamically or in /etc/system file.

And
What I didn't mention before is that the page coalescing issue is specifically 
mentioned with the Niagara family of CPUs, which is what this T5220, is running 
on systems running Java applications and Oracle databases (the Oracle part 
being pertinent here.)  Still not saying that it's definitively going to 
resolve the problems, but it's worth trying based on the system type, Oracle, 
and symptoms.

This is dynamic change.  Support person says we can easily toggle this back 
with no service interruption
Client is not buying that and I was just wondering  what experience anyone else 
has had with T5220s?

Support said they did this mostly for SAP and while we run a number of SAPs, 
not on this server which I would categorize as relatively lightly loaded.
Prod is far busier during nightly batch window.   Scheduled stats run well 
prior to that for 3-13 minutes.

Server and database have been up close to 2 years and they just noticed these 
processes running longer about 6 weeks ago.
They put a new release in TEST but claim problem started just prior to that.  
Not refuting that.

Thanks for any ideas, suggestions, experiences.

  Larry
***************************************************************************
The information contained in this communication is confidential, is
intended only for the use of the recipient named above, and may be legally
privileged.

If the reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying of this
communication is strictly prohibited.

If you have received this communication in error, please resend this
communication to the sender and delete the original message or any copy
of it from your computer system.

Thank You.
****************************************************************************

Other related posts: