RE: recover standby database failure

  • From: Carel-Jan Engel <careljan@xxxxxxxxxx>
  • To: mwf@xxxxxxxx
  • Date: Thu, 15 May 2008 23:03:29 +0200

At a customer site, with Standard Edition 'scripted' archive log
shipping standby, they want a true DR test. This means: reverse roles,
and run the business with the DR system for a couple of hours, and
reverse roles again. Any site should do this, but most of them don't
dare. Why spending money on a DR site if you don't trust it?

Both databases and instances have the same name. Disk layout and naming
of mountpoints is identical at primary and standby.
'Awareness' of being a primary or standby is in the control file.
Datafiles are identical at primary and standby, if recovery is
succesful.
You can test the standby by opening it read only, but doesn't allow the
business to use it.

'Activating' the standby would require a re-instantiate of the primary.
Given the size of the database and the available bandwidth, that is not
an option just for testing purposes.

The DR test goes along the following path:


     1. Shutdown normal the primary.
     2. Ship the last archived redo logfiles to DR.
     3. Make sure the last archived redo log files have been applied at
        the standby
     4. Make backups of all control files, parameter files, online redo
        log files (yes I wrote online redo log files) at both primary
        and standby. Maybe you can skip the ORLFs, but I haven't tested
        that.
     5. 'Swap' control files, parameter files and OLRFs between primary
        and standby. This limits the amount of data exchanged through
        the WAN to a minimum.
     6. Start the instance at the standby, as were it the primary.
        Actually, because its controlfile now is that of the primary, it
        is the primary.
     7. Start the instance at the primary site as were it the standby.
        Same story about controlfile.
     8. Start the listener and applications, and let the users do what
        they do when they use the system.
     9. Run the archive redo log copy scripts in the reversed direction,
        from DR to primary.
    10. After the test, go to step 1 to get back to normal.


After testing the whole thing with a test database it was scripted by
the local DBA. Now the CT has a SE archive log shipping standby with
switch over capabilities. No cloning necessary. 

About this test database: I always have an, as small as possible, test
database at every production system with a HA setup, just to be able to
test all infrastructure components involved. This test database has a
standby as well. It is useful for training, testing firewall stuff and
other LAN/WAN issues, gaining experience, testing anything else
regarding the HA setup, gaining self confidence. 

Best regards,

Carel-Jan Engel

===
If you think education is expensive, try ignorance. (Derek Bok)
===

On Thu, 2008-05-15 at 11:24 -0400, Mark W. Farnham wrote:

> Most likely the operation of opening a standby manually managed as described
> is destructive unless you cancel recovery, shut down, copy clone and do a
> startup rename resetlogs on the clone to test if you have in fact correctly
> manually managed a "roll your own" standby. Then if the open is successful
> you probably need to run a lot of reports to make sure the recovery test was
> actually successful rather than only apparently successful. Why I might not
> be satisfied unless all the weekly and month end reports appeared to be
> perfect! And who better to evaluate whether the reports look correct than
> the folks who otherwise might be running those reports on the production
> primary database?
> 
> Since manually managing a recovery standby is error prone, I do recommend
> executing this copy clone open frequently. The renamed database can be used
> as a "frozen reporting database" while recovery is resumed on the standby of
> the original name. If you're rolling your own rather than using Oracle's
> software to manage a standby, there several things you can do to destroy the
> validity of the standby (such as unlogged actions on the "primary"). Of
> course if you regularly test your standby by doing a clone/rename/open
> resetlogs, that normally will decouple you from a simultaneous actual
> problem with your "primary." Over time you can do the math of frequency of
> errors detected with the standby process versus frequency of problems of the
> "primary" to determine your risk and ask management whether they want to
> spend more money to reduce that risk.

<snip>

Other related posts: