Thanks Hagan; We did some quick test, and found one open issue: Database failed over as planned(by shutdown abort old primary), pretty quick indeed; That's good! However: *Without shutdown the old primary database listener, application server still talk to old primary database and got stuck;* Once we shutdown the old primary listener, it talks right; (we were simulating oracle crashes, but host is still up , if host is down then it should be working fine;) client TNSNAMES.ORA: x86= (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP) (HOST = qadb121)(PORT = 1999)) (CONNECT_DATA = (SERVICE_NAME = xfan) (SERVER = DEDICATED) ) ) (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP) (HOST = qadb120)(PORT = 1999)) (CONNECT_DATA = (SERVICE_NAME = xfan) (SERVER = DEDICATED) ) ) ) Server A: qadb121; old primary -->new standby Server B: qadb120; old standby -->new primary Dataguard internal communication using port 1600; We do not plan to switch IP/DNS (based on best practise from various whitepapers); Any experience how workaround the problem? I believe this is also a typical case we have to go through to deploy that in production; I tried using trigger, didn't work; create trigger FSFO after db_role_change on database declare v_role varchar(30); begin select database_role into v_role from v$database; if v_role = 'PRIMARY' then DBMS_SERVICE.START_SERVICE('XFAN'); else DBMS_SERVICE.STOP_SERVICE('XFAN'); end if; end; / Thx much! On Fri, Oct 1, 2010 at 12:19 AM, Zhu,Chao <zhuchao@xxxxxxxxx> wrote: > This is very good and detailed production experience, Really appreciate > your comments/sharing!!! > > I will read it carefully and discuss with team member and come back later > on this topic; > > We have several hundred database running dataguard without FSFO, some of > they are very busy as well; If this can be a good case we can try from > smaller system and learn expereince slowly; > > Thx > > > On Thu, Sep 30, 2010 at 7:39 AM, Craig Hagan <hagan@xxxxxxx> wrote: > >> >> >> 2010/9/30 Zhu,Chao <zhuchao@xxxxxxxxx> >> >>> >>> So we have a few questions regarding this: >>> 1. We already have dataguard configured for most of our database ( >>> 10.2.0.3/4); Now we want to use dataguard FSFO; Is this part of the >>> dataguard license and do we need to pay extra for that? >>> >>> >> I'm not sure how the licensing works, this would be a question for your >> oracle sales rep. >> >> >>> 2. Is the production mature already(it come out in 10.2 i believe); We >>> plan to use it on 11g database only (11.2 and 11.1.0.7); Clustering is >>> something typical DBA not familiar with(compared with VSC type of HA for >>> Unix guys) >>> >>> >> >> I've been using fast start failover in production at a name site with >> large volumes of traffic since 10.2.0.2. As long as you configure it >> correctly and have the latest DG megapatch, you should be fine. >> >> >>> 3 . How does it work in real-life production? Any company widely using >>> it? I saw notes from a Amazon DBA on >>> http://www.nocoug.org/download/2009-05/DBA%27s_Guide_to_Physical_Dataguard_II.pptxtalking >>> about FSFO; Not sure about their real-life experience running that >>> kind of solution; >>> >>> >> I know Ahbid, and run systems similar to his. >> >> First off some background as to how I've seen it run: >> >> 1) primary/standby are physically distant (different datacenters, but >> fairly close geographically, speed of light/network latency/bandwidth isn't >> a concern). >> >> 2) primary/standby do not share storage with eachother >> >> 3) observer systems are deliberately run in a 3rd site/datacenter, and is >> explicitly not located in the same datacenter as either the primary or >> standby >> >> >> Given that, the single largest issue that I've seen with fast start (10.2, >> 11.1) is misconfiguration. Even subtle errors which will allow the >> primary/standby to be configured and fsf enabled can result in reinstatement >> to fail after an event. I ended up building a tool to emit configurations >> that we were happy with in production to eliminate this form of error. >> >> A few odds and ends from several years of use, nb: don't be scared by some >> of these as a lot of things have been patched/fixed by oracle. >> >> * If your system generates a lot of redo, you're going to want to pay >> attention to things like # of log archive processes and the parameter >> max_connections (default of 1 is a bit low). >> >> * I've seen after a failover/reinstatement that I've occasionally had to >> re-register log sequence 1 of the new thread on the "new" standby and/or >> bystanders, make sure you do this at the right time (when the standby is >> asking for the nonexistant/next sequence from the old resetlogsid). >> >> * In 10.2.02 (there is a patch, i believe it is also be in the DG >> megapatch), I've seen quirks with flashback where it would claim to be on, >> but not actually be generating much/any flashback logs. Its pretty obvious >> if you run into this: if your recovery area should be 10G, and you see two >> files for a few kilobytes and the db has been up for a few months, it >> probably is a concern. >> >> * for an unplanned flip, fsf will only fail over if the primary/standby >> can't talk to each other and the standby is synchronized and can talk with >> the observer. this means that if your primary hits an event (memory >> pressure, certain types of hardware/os faults) that freeze/mess up the db, >> but leave it just sufficiently alive that the standby thinks it is up, it >> won't fail. The same can also result in desynchronization >> >> * I've seen issues where very odd/freak network events or hardware faults >> on the standby result in lgwr terminating the primary. This was mostly in >> 10.2.0.2 >> >> * for 11.x, be careful of user sessions on the standby if you're also >> running active dataguard as they may delay the transition from standby to >> primary as oracle terminates those sessions. >> >> * DO NOT use mts sessions for dataguard, and be careful with live >> implementations of mts on a system using DG, you can really piss off the >> broker/fast start/and DG. otoh, it is pretty easy to fix this on the fly, >> too. much easier to explicitly specify dedicated sessions for the tnsnames >> entries used for your broker sessions to prevent this sort of silliness. >> >> * if you run into odd things, you may want to seriously consider >> rebuilding your broker configuration, do make sure that all standby systems >> have been reinstated before doing this. >> >> * Don't play games with standby dbs -- by that, I mean rebuilding a broker >> config and tossing in a new controlfile to work around a failed >> re-instatement. Either rebuild the standby from backup, or work with support >> to make sure that your actions truly are safe and won't result in a >> ORA-03020 or worse later on. >> >> * If you have a complicated network, make sure that the >> FastStartFailoverThreshold is a bit longer than the time it takes spanning >> tree to recompute (work with your network engineers on this). You probably >> don't want a switch reconfiguration which will resolve itself in 5-45seconds >> to trip a failover which will take that time plus additional time for the >> other side to finish the failover. >> >> * failed/aborted failovers can be annoying to clean up :) >> >> * user initiated failovers in 11.x are cool; just remember to restart and >> reinstate the old primary. >> >> >> >> -- craig >> .- ... . -.-. .-. . - -- . ... ... .- --. . >> >> Craig I. Hagan >> hagan(at)cih.com >> >> "Tout ce qui est exagéré est insignifiant.": ("All that is exaggerated >> is insignificant.") >> >> Talleyrand >> >> > > > -- > Regards > Zhu Chao > > > -- Regards Zhu Chao