Re: RAC Vs Standby Database between Primary and Secondary Data Centers

  • From: Andrey Kriushin <Andrey.Kriushin@xxxxxxxx>
  • To: dannorris@xxxxxxxxxxxxx
  • Date: Tue, 22 Jan 2008 22:53:55 +0300


Hi,
comments inline

--Andrey

Dan Norris wrote:
Dick,

Here's where I think we need to make clear what defines "high availability" versus what becomes "disaster recovery". Many sites want/need both. In my dictionary, I define high availability as a system that can tolerate a failure of a single component without affecting the application availability. There's also "fault tolerance", but that starts to get into a whole other world, so let's put that out of scope for now.
IMHO, the mentioning of "fault tolerance" (FT) is very appropriate. Due to widly speaded misconception of those, who are new to Oracle RAC or not informed enough to resist the marketroid's push. Namely, the HA is often read as "FT". I.e. many believe that long running query... or the batch job which modifies the data but doesn't make the SAVE POINTS (do not confuse with the SAVEPOINT SQL command :-)) would just continue its work from the point of failure after the failover to the survived node.

... skipped
As another poster mentioned, RAC does have some support for "stretch clusters", but they are not widely used and the MAA still recommends standby database in combination with RAC (at least the last time I read it).

The terminology is not very stable here (stretch clusters). Definitely, the nodes of the cluster are distributed among several (at least two) data centres and each data centre has its own storage. Usually people consider the two capabilities:

1. (Most commonly used) There is a synchronous replication of disk blocks between the storages via the hardware capabilities of the storages. All the nodes on of the particular site are working with the local storage. Is this case there is one cluster-critical point - the quorum disk (if it is used in the underlying clusterware). When the entire data centre is failured or there is a failure of it's storage, then either the nodes of that center are considered dead until the failure will be resolved, or they switch to the storage of the other data centre just as another "local" storage (if there is proper capability).

2. All nodes of the stretch cluster are using only one storage at chosen data center. For all of the nodes at any site the storage looks like "local". The other storages contain the standby(-ies), served by the DataGuard/synchronous replication by storage HW/combination of synch (for crilical files) and async for datafiles replication of storage HW. I've heard of H.A.R.D. initiative, though have not practical experience and/or good docs on that. Would be interested if experienced colleagues will point me to the right docs.

The first configuration of "stretch cluster" is Higher Available, as the second one requires some manual steps (symbolic links redefinition etc). The second is usually cheaper. Also the first configuration might provide better throughput - more storages are runnig in parallel - especially when the modification rate is moderate.

Jared also mentioned the human error... Well... Uhgg... Which can be better tolerated with the DataGuard with its delay in applying of the archived logs.

Other related posts: