RAC Cluster - 100% cpu on all nodes
I got called today about one of our RAC clusters (RHEL 4, 2 cpus, 8GB
RAM, 10.2.0.1, 32-bit, ASM 2, EMC clariion cx700 storage, dual qlogic
hbas). that was locked up.
- From: Steve Perry <sperry@xxxxxxxxxxx>
- To: oracle-l <oracle-l@xxxxxxxxxxxxx>
- Date: Mon, 12 Jun 2006 19:11:48 -0500
the cpu on both nodes were 100%. It took several minutes to login and
I could never get into sqlplus (10-15 minutes waiting).
I also tried to shut it down with srvctl also but it didn't respond
IO was near zero - make sense. the cpu starved all other resources.
no errors in the alert.logs (both nodes) for both asm and the
instances - just a gap in the entries from 10am - 2pm (reboot).
No new trace files.
nothing significant in the ka-zillion logs in the clusterware home.
no errors in /var/log/messages
while doing a ps -ef, I saw 20+ processes of:
some were owned by root and some by oracle and everyone took about 5%
they didn't want to wait for diagnosis so they said to to reboot them
it came up fine, but after the reboot there was only one of the
processes mentioned above.
I run the cluvfy and it passed all the tests.
I ran the awr reports after from 10am to 2pm but haven't analyzed
Has anyone else experienced this with RAC? Is there a quick hit list
of things you check when things go south?
I'm pretty methodical and started checking the standard things, but
that wasn't fast enough for these folks.
What do you check when all nodes of a RAC cluster are locked up like
I contacted support, but I don't have much hope based on my recent
p.s. I forgot to grab the sar data. to see what it shows. I'll do
Other related posts: