Re: Weird database hanging

  • From: "Rajeev Prabhakar" <rprabha01@xxxxxxxxx>
  • To: don@xxxxxxxxx
  • Date: Fri, 21 Sep 2007 20:30:58 -0400

Don

That's interesting....I want to share our experience just in case it
helps anyone..

While conducting stress tests against a two node 10.2.0.3 rac/asm/
SAN based database, we were facing near freeze / hang besides the
(ORA-3136) error and ipc timeouts followed by node evictions.

So, we tried all the recommended things. Bumped up sqlnet/listener
timeouts, sessions/processes/pga_aggregate_target, shared pool
size etc.. without any luck. The near freeze/hang continued beyond
a particular number of concurrent database sessions. We doubly
checked our o.s. params etc just in case...but it didn't help.

Later, we decided to increase swap space (given some low available
swap space observed during these tests even when memory was
available) and we have found that post  increase, the database hangs/
node evictions didn't occur any more AND the load tests completed
the allocated window. Although, concurrency continued to be the #1
wait during these window, but all our instances(db/asm) survived the
load test.

Now, it is quite possible that we haven't fixed the root cause and
this is just a distraction/giving us a temporary breather.

Anyway, if we find something later (e.g. a bug etc.), I'll let everyone
know..

-Rajeev

On 9/21/07, Don Seiler <don@xxxxxxxxx> wrote:
>
> We *think* we have found the issue, and it isn't quite Oracle-related
> (of course).
>
> The SA had been doing a Veritas online relayout on the disk partition
> that is our archivelog destination.  He aborted it, but rather than
> aborting, Veritas left it in a "paused" state.  This happend 20
> minutes before the bulk load that caused our first instance hang.
> Note that we *were* able to archive logs, it just seemed to have
> caused some more waiting than normal.  This was compounded during bulk
> loads, and in the end caused a crush of shared pool and library cache
> latches.
>
> This situation was discovered yesterday and the times seemed all too
> coincidental.  The state was corrected and we've been happily bulk
> loading anything and everything since then.
>
> In the end, we recognize there is plenty of room for improvement in
> the application code (and horrible inefficiencies in the app database
> design), but were quite certain that wasn't the root cause of this
> problem.  I'm still pretty upset with Oracle support over their
> blinders and insistence that the problem was "properly diagnosed" and
> ignored all of my input and feedback.
>
> Don.
>
> --
> Don Seiler
> oracle: http://ora.seiler.us
> ultimate: http://www.mufc.us
> --
> //www.freelists.org/webpage/oracle-l
>
>
>

Other related posts: