RE: RAC on OCFS2 acceptance testing

  • From: "Kevin Closson" <kevinc@xxxxxxxxxxxxx>
  • To: "freelists" <oracle-l@xxxxxxxxxxxxx>
  • Date: Thu, 28 Dec 2006 08:55:27 -0800

 >>>
>>>>>> 4. FO: Cascading failures [?]
>>>>> yes
>>>
>>>Could you elaborate? What kind of realistic cascading 
>>>failure scenarios would you recommend?

tail the ocssd logs and as CRS is dealing with one failure, 
manually inject another on a different node. Take you pick.
For instance, inject loss of connectivity path wait until
CRS is dealing with that and then sever the interconnect from the
server that is becoming the CRS master and so on. Remember,
CRS is a master-slave architecture. Be creative. Be ugly. Save
yourself future headache by finding issues now.

>>>
>>>---
>>>
>>>Given limited timeframe I'll stick to just the functional 
>>>tests, but I can submit a proposal 

...usually the case. Like I routinely point out, there are 
very few Oracle shops with enough manpower to actually do
clustered Oracle right. The self-managed database thing and
the clustered thing are not complementary really.
>>>
>>>The question that will obviously be asked is how relevant 
>>>all these wild scenarios (e.g. "dd(1) loop to /dev/null 
>>>using absurdly large values assigned to the ibs argument") 
>>>to the application at hand? Fencing, split-brain and other 
>>>fascinating problems might be quite real in some cases, but 
>>>are they for this specific app? 

... the wild scenarios I describe are to simulate the 
situation servers can get in when things go wrong. Unless 
you have those application bugs sitting around as stimulus
how else to you create a memory starvation issue that can
happen with simple bugs like stack recursion. The point is
that RAC is supposed to help you survive a node being overloaded
to the point of being "ill". Prove it. 

>>>Earlier this year, somebody mentioned (if I understood
>>>correctly) that there are problems managing a 2-node RAC 
>>>deployed on OCFS2 hosted by SLES9 boxes due to the lack of 
>>>quorum and the quality of the OCFS2 code.
>>>What's the likelihood of this happening though?

It is not theoretical. It is fact. Only you can tell us
if it will be ok to lose an entire cluster due to OCFS2
split brain just because you lost one of the 2 nodes? Sort
of the antithesis of what you paid for isn't it?


>>>BTW, is your "Database Utility for RAC" available only to 
>>>the Polyserve customers? Does it work with OCFS2?

It is the PolyServe database product and predates OCFS
and is a super, super-set of OCFS so no it doesn't
work with it. 

There are people who are happy with OCFS. I also think there
are a significant number of people that have never been through
an Oracle license audit--two points that are joined at the hip.
--
//www.freelists.org/webpage/oracle-l


Other related posts: