Re: A few questions regarding Dataguard Faststart Failover

  • From: "Zhu,Chao" <zhuchao@xxxxxxxxx>
  • To: Craig Hagan <hagan@xxxxxxx>
  • Date: Sat, 2 Oct 2010 00:24:15 +0800

Thanks Hagan;
We did some quick test, and found one open issue:
Database failed over as planned(by shutdown abort old primary), pretty quick
indeed; That's good!
However:
*Without shutdown the old primary database listener, application server
still talk to old primary database and got stuck;* Once we shutdown the old
primary listener, it talks right;
(we were simulating oracle crashes, but host is still up , if host is down
then it should be working fine;)

client TNSNAMES.ORA:
x86=
  (DESCRIPTION_LIST =
    (DESCRIPTION =
    (ADDRESS =
      (PROTOCOL = TCP)
      (HOST = qadb121)(PORT = 1999))
      (CONNECT_DATA =
        (SERVICE_NAME = xfan)
        (SERVER = DEDICATED)
      )
    )
    (DESCRIPTION =
    (ADDRESS =
      (PROTOCOL = TCP)
      (HOST = qadb120)(PORT = 1999))
      (CONNECT_DATA =
        (SERVICE_NAME = xfan)
        (SERVER = DEDICATED)
      )
    )
  )


Server A: qadb121;  old primary -->new standby
Server B: qadb120;  old standby -->new primary
 Dataguard internal communication using port 1600;

We do not plan to switch IP/DNS (based on best practise from various
whitepapers);

Any experience how workaround the problem? I believe this is also a typical
case we have to go through to deploy that in production;

I tried using trigger, didn't work;
create trigger FSFO
after db_role_change on database
declare
 v_role varchar(30);
begin
 select database_role into v_role from v$database;
 if v_role = 'PRIMARY' then
 DBMS_SERVICE.START_SERVICE('XFAN');
 else
 DBMS_SERVICE.STOP_SERVICE('XFAN');
 end if;
end;
/
Thx much!

On Fri, Oct 1, 2010 at 12:19 AM, Zhu,Chao <zhuchao@xxxxxxxxx> wrote:

> This is very good and detailed production experience, Really appreciate
> your comments/sharing!!!
>
> I will read it carefully and discuss with team member and come back later
> on this topic;
>
> We have several hundred database running dataguard without FSFO, some of
> they are very busy as well; If this can be a good case we can try from
> smaller system and learn expereince slowly;
>
> Thx
>
>
> On Thu, Sep 30, 2010 at 7:39 AM, Craig Hagan <hagan@xxxxxxx> wrote:
>
>>
>>
>> 2010/9/30 Zhu,Chao <zhuchao@xxxxxxxxx>
>>
>>>
>>> So we have a few questions regarding this:
>>> 1. We already have dataguard configured for most of our database (
>>> 10.2.0.3/4); Now we want to use dataguard FSFO; Is this part of the
>>> dataguard license and do we need to pay extra for that?
>>>
>>>
>> I'm not sure how the licensing works, this would be a question for your
>> oracle sales rep.
>>
>>
>>> 2. Is the production mature already(it come out in 10.2 i believe); We
>>> plan to use it on 11g database only (11.2 and 11.1.0.7);  Clustering is
>>> something typical DBA not familiar with(compared with VSC type of HA  for
>>> Unix guys)
>>>
>>>
>>
>> I've been using fast start failover in production at a name site with
>> large volumes of traffic since 10.2.0.2. As long as you configure it
>> correctly and have the latest DG megapatch, you should be fine.
>>
>>
>>> 3 . How does it work in real-life production? Any company widely using
>>> it? I saw notes from a Amazon DBA on
>>> http://www.nocoug.org/download/2009-05/DBA%27s_Guide_to_Physical_Dataguard_II.pptxtalking
>>>  about FSFO; Not sure about their real-life experience running that
>>> kind of solution;
>>>
>>>
>> I know Ahbid, and run systems similar to his.
>>
>> First off some background as to how I've seen it run:
>>
>> 1) primary/standby are physically distant (different datacenters, but
>> fairly close geographically, speed of light/network latency/bandwidth isn't
>> a concern).
>>
>> 2) primary/standby do not share storage with eachother
>>
>> 3) observer systems are deliberately run in a 3rd site/datacenter, and is
>> explicitly not located in the same datacenter as either the primary or
>> standby
>>
>>
>> Given that, the single largest issue that I've seen with fast start (10.2,
>> 11.1) is misconfiguration. Even subtle errors which will allow the
>> primary/standby to be configured and fsf enabled can result in reinstatement
>> to fail after an event. I ended up building a tool to emit configurations
>> that we were happy with in production to eliminate this form of error.
>>
>> A few odds and ends from several years of use, nb: don't be scared by some
>> of these as a lot of things have been patched/fixed by oracle.
>>
>> * If your system generates a lot of redo, you're going to want to pay
>> attention to things like # of log archive processes and the parameter
>> max_connections (default of 1 is a bit low).
>>
>> * I've seen after a failover/reinstatement that I've occasionally had to
>> re-register log sequence 1 of the new thread on the "new" standby and/or
>> bystanders, make sure you do this at the right time (when the standby is
>> asking for the nonexistant/next sequence from the old resetlogsid).
>>
>> * In 10.2.02 (there is a patch, i believe it is also be in the DG
>> megapatch), I've seen quirks with flashback where it would claim to be on,
>> but not actually be generating much/any flashback logs. Its pretty obvious
>> if you run into this: if your recovery area should be 10G, and you see two
>> files for a few kilobytes and the db has been up for a few months, it
>> probably is a concern.
>>
>> * for an unplanned flip, fsf will only fail over if the primary/standby
>> can't talk to each other and the standby is synchronized and can talk with
>> the observer. this means that if your primary hits an event (memory
>> pressure, certain types of hardware/os faults) that freeze/mess up the db,
>> but leave it just sufficiently alive that the standby thinks it is up, it
>> won't fail. The same can also result in desynchronization
>>
>> * I've seen issues where very odd/freak network events or hardware faults
>> on the standby result in lgwr terminating the primary. This was mostly in
>> 10.2.0.2
>>
>> * for 11.x, be careful of user sessions on the standby if you're also
>> running active dataguard as they may delay the transition from standby to
>> primary as oracle terminates those sessions.
>>
>> * DO NOT use mts sessions for dataguard, and be careful with live
>> implementations of mts on a system using DG, you can really piss off the
>> broker/fast start/and DG. otoh, it is pretty easy to fix this on the fly,
>> too. much easier to explicitly specify dedicated sessions for the tnsnames
>> entries used for your broker sessions to prevent this sort of silliness.
>>
>> * if you run into odd things, you may want to seriously consider
>> rebuilding your broker configuration, do make sure that all standby systems
>> have been reinstated before doing this.
>>
>> * Don't play games with standby dbs -- by that, I mean rebuilding a broker
>> config and tossing in a new controlfile to work around a failed
>> re-instatement. Either rebuild the standby from backup, or work with support
>> to make sure that your actions truly are safe and won't result in a
>> ORA-03020 or worse later on.
>>
>> * If you have a complicated network, make sure that the
>> FastStartFailoverThreshold is a bit longer than the time it takes spanning
>> tree to recompute (work with your network engineers on this). You probably
>> don't want a switch reconfiguration which will resolve itself in 5-45seconds
>> to trip a failover which will take that time plus additional time for the
>> other side to finish the failover.
>>
>> * failed/aborted failovers can be annoying to clean up :)
>>
>> * user initiated failovers in 11.x are cool; just remember to restart and
>> reinstate the old primary.
>>
>>
>>
>> -- craig
>>           .-    ... . -.-. .-. . -    -- . ... ... .- --. .
>>
>>                             Craig I. Hagan
>>                            hagan(at)cih.com
>>
>>     "Tout ce qui est exagéré est insignifiant.": ("All that is exaggerated
>> is insignificant.")
>>
>>                             Talleyrand
>>
>>
>
>
> --
> Regards
> Zhu Chao
>
>
>


-- 
Regards
Zhu Chao

Other related posts: