Re: A few questions regarding Dataguard Faststart Failover

From: Craig Hagan <hagan@xxxxxxx>
To: zhuchao@xxxxxxxxx
Date: Thu, 30 Sep 2010 10:39:04 -0400
2010/9/30 Zhu,Chao <zhuchao@xxxxxxxxx>

>
> So we have a few questions regarding this:
> 1. We already have dataguard configured for most of our database (
> 10.2.0.3/4); Now we want to use dataguard FSFO; Is this part of the
> dataguard license and do we need to pay extra for that?
>
>
I'm not sure how the licensing works, this would be a question for your
oracle sales rep.


> 2. Is the production mature already(it come out in 10.2 i believe); We plan
> to use it on 11g database only (11.2 and 11.1.0.7);  Clustering is something
> typical DBA not familiar with(compared with VSC type of HA  for Unix guys)
>
>

I've been using fast start failover in production at a name site with large
volumes of traffic since 10.2.0.2. As long as you configure it correctly and
have the latest DG megapatch, you should be fine.


> 3 . How does it work in real-life production? Any company widely using it?
> I saw notes from a Amazon DBA on
> http://www.nocoug.org/download/2009-05/DBA%27s_Guide_to_Physical_Dataguard_II.pptxtalking
>  about FSFO; Not sure about their real-life experience running that
> kind of solution;
>
>
I know Ahbid, and run systems similar to his.

First off some background as to how I've seen it run:

1) primary/standby are physically distant (different datacenters, but fairly
close geographically, speed of light/network latency/bandwidth isn't a
concern).

2) primary/standby do not share storage with eachother

3) observer systems are deliberately run in a 3rd site/datacenter, and is
explicitly not located in the same datacenter as either the primary or
standby


Given that, the single largest issue that I've seen with fast start (10.2,
11.1) is misconfiguration. Even subtle errors which will allow the
primary/standby to be configured and fsf enabled can result in reinstatement
to fail after an event. I ended up building a tool to emit configurations
that we were happy with in production to eliminate this form of error.

A few odds and ends from several years of use, nb: don't be scared by some
of these as a lot of things have been patched/fixed by oracle.

* If your system generates a lot of redo, you're going to want to pay
attention to things like # of log archive processes and the parameter
max_connections (default of 1 is a bit low).

* I've seen after a failover/reinstatement that I've occasionally had to
re-register log sequence 1 of the new thread on the "new" standby and/or
bystanders, make sure you do this at the right time (when the standby is
asking for the nonexistant/next sequence from the old resetlogsid).

* In 10.2.02 (there is a patch, i believe it is also be in the DG
megapatch), I've seen quirks with flashback where it would claim to be on,
but not actually be generating much/any flashback logs. Its pretty obvious
if you run into this: if your recovery area should be 10G, and you see two
files for a few kilobytes and the db has been up for a few months, it
probably is a concern.

* for an unplanned flip, fsf will only fail over if the primary/standby
can't talk to each other and the standby is synchronized and can talk with
the observer. this means that if your primary hits an event (memory
pressure, certain types of hardware/os faults) that freeze/mess up the db,
but leave it just sufficiently alive that the standby thinks it is up, it
won't fail. The same can also result in desynchronization

* I've seen issues where very odd/freak network events or hardware faults on
the standby result in lgwr terminating the primary. This was mostly in
10.2.0.2

* for 11.x, be careful of user sessions on the standby if you're also
running active dataguard as they may delay the transition from standby to
primary as oracle terminates those sessions.

* DO NOT use mts sessions for dataguard, and be careful with live
implementations of mts on a system using DG, you can really piss off the
broker/fast start/and DG. otoh, it is pretty easy to fix this on the fly,
too. much easier to explicitly specify dedicated sessions for the tnsnames
entries used for your broker sessions to prevent this sort of silliness.

* if you run into odd things, you may want to seriously consider rebuilding
your broker configuration, do make sure that all standby systems have been
reinstated before doing this.

* Don't play games with standby dbs -- by that, I mean rebuilding a broker
config and tossing in a new controlfile to work around a failed
re-instatement. Either rebuild the standby from backup, or work with support
to make sure that your actions truly are safe and won't result in a
ORA-03020 or worse later on.

* If you have a complicated network, make sure that the
FastStartFailoverThreshold is a bit longer than the time it takes spanning
tree to recompute (work with your network engineers on this). You probably
don't want a switch reconfiguration which will resolve itself in 5-45seconds
to trip a failover which will take that time plus additional time for the
other side to finish the failover.

* failed/aborted failovers can be annoying to clean up :)

* user initiated failovers in 11.x are cool; just remember to restart and
reinstate the old primary.



-- craig
          .-    ... . -.-. .-. . -    -- . ... ... .- --. .

                            Craig I. Hagan
                           hagan(at)cih.com

    "Tout ce qui est exagéré est insignifiant.": ("All that is exaggerated
is insignificant.")

                            Talleyrand
Follow-Ups:
- Re: A few questions regarding Dataguard Faststart Failover
  - From: Zhu,Chao
References:
- A few questions regarding Dataguard Faststart Failover
  - From: Zhu,Chao
Re: A few questions regarding Dataguard Faststart Failover

Other related posts: