Re: Is RAC DOA?

  • From: Ray Stell <stellr@xxxxxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Tue, 17 Aug 2004 13:48:22 -0400

I tried to create as disturbing a subject line as I could to get
attention.  Thank you for all your responses, very interesting.  

There were some replies on my first query topic that led me to believe
the waters had been muddied here.  I originally posted about a public
network failure and the responses on this topic where about the private
net.  I have posted privately to those who showed an interest in this.
I thought I'd try to stir up some trouble here in hopes that a clue 
would float up to the surface for me. 

So, there are two parts here: 
I. my techie problem 
II. my Oracle Corp. TAR experience 


I. techie problem:
------------------
environment:
1. Oracle 10.1.0.2 (10.1.0.3 came out this week, I'll go there next,
   but my tar consultant doesn't think it will be fruitful)
2. 2 - Red Hat 3.0 intel x86 servers
3. 1 - shared disk on NetApp filer
4. Oracle clustering, CRS
5. 1 - client running running same OS and Oracle versions
6. cluster/client configured to failover and load balance across 
   the two servers
7. note that in this configuration, an instance failure (shutdown
   abort) fails over perfectly, long running queries fail over immediately, 
   sometimes I could not even detect it had happened during the output 
   display (very cool, btw)

test:
1. sqlplus connect to cluster
2. determine which node I get connected to
3. run long running query on that node
4. inject a network failure on server interface the client is
   connected to 
5. sessions to other node are locked up.
6. long running query fails, ends in ora-3113
7. watch both nodes stay locked up until network is restored

Can anyone verify that you have seen RAC recover from this sort of
failure?  Thanks for your help.


II. my Oracle Corp. experience:
-------------------------------
I open a TAR to review the issue.  Everyone (me, Oracle TAR consultant,
developer he is consulting with) from the start thinks we have a config
problem and assume this can be worked around.  This implies to me that it is
believed that RAC can survive public network failures.  I can't for the
life of me figure out now why we all thought this.  I've gone back to the
docs and searched.  There seems to be little detail on what high
availability means in practice.

I spent a month working a TAR over the effect of a public network
failure on the cluster.  We work on fixing my config, since it must be
the source of the problem.  A month later, they come back to me and say
this is a "bug" that will not be fixed as it is considered a new
feature request.  What's the bug number?  "2791912".  Not found.

Now, my tar is not closed, yet.  I have posted to the consultant these
questions:  

1. Can I get the real bug number?  Can I read any of the detail?
2. Why did he think this should work when we started?`
3. Why is this considered a "bug"?  Do interal Oracle people think 
   this should work under some configuration?  
4. Are there other hardware/software combinations that I might try to 
   test that are not subject to this mystery bug? 
5. Does this make Linux not a viable/supported platform for HA?

No response since last Friday.  Maybe they're on vacation before the 
kids go back to school.  As one person put it, "we started
(our) company because we saw what a disaster Oracle was making out of a
pretty cool technology."  ;)  
===============================================================
Ray Stell   stellr@xxxxxx     (540) 231-4109     KE4TJC    28^D
----------------------------------------------------------------
Please see the official ORACLE-L FAQ: http://www.orafaq.com
----------------------------------------------------------------
To unsubscribe send email to:  oracle-l-request@xxxxxxxxxxxxx
put 'unsubscribe' in the subject line.
--
Archives are at //www.freelists.org/archives/oracle-l/
FAQ is at //www.freelists.org/help/fom-serve/cache/1.html
-----------------------------------------------------------------

Other related posts: