RE: Data Mirroring on two data centers -- How to use ASM ?

  • From: "Kevin Closson" <kevinc@xxxxxxxxxxxxx>
  • To: <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 19 May 2006 10:23:06 -0700

>>>> when network failures occur no "third party" can choose which node 
>>>> should survive. So a manual failover is the only solution. Only a 
>>>> third site will give you enough "quorum" to provide an 

This represents a very redimentary understanding of clusters, or
more likely a very deep understanding of very redimentary clusters. 

Two node clusters can work out proper membership and split-brain
resolution, but it requires sophistacted membership and fencing
mechanisms.
The simple "who's got more" sort of quorum stuff is just not robust
enough. In fact, it is for this reason that SuSE has said that
2 node clusters with OCFS2 are not possible. You must have a
minimum of 3 nodes...as was the case for quite some time with
GPFS on AIX. In case anyone thinks I'm making up this bit about
quorum and fencing:

http://lists.suse.com/archive/suse-oracle/2006-Apr/0061.html
http://lists.suse.com/archive/suse-oracle/2006-Apr/0071.html

It is a fact that most cluster membership schemes available out
there are architected poorly for sake of first-to-market needs. Or
they carry age-old legacy implementation choices.  The
OCFS2 problem cited in these suse-oracle email archives do indeed
reflect bugs. However, the architecture itself will continue to
breed bugs. Architecture choices can be "Bug Factories". Consider
shared nothing cluster database approaches. They are bug factories too
really.
Any of the following clustering approaches are bug factories. The reason
they are bug factories is because the architecture is not solid
enough to "just work", so layers and layers and layers of workarounds
in the form of bug fixes ensue. I wont name the products that 
implement the following Achilles Heel cluster architectures, but they
are out there:

        1. Persistent reservation schemes
        2. Self fencing (e.g., node has been informed it's supposed
           to die so it tries to execute the reboot command)
        3. Simple majority quorum 
        4. Central lock managers/metadata managers (SPOF, bottleneck)
--
//www.freelists.org/webpage/oracle-l


Other related posts: