Re: Upgrade from 9.2 to 10.2 (NON-Rac) - Basic Steps

  • From: "Mark Strickland" <strickland.mark@xxxxxxxxx>
  • To: davidsharples@xxxxxxxxx
  • Date: Tue, 17 Oct 2006 11:51:03 -0700

I can't address the 9.2 to 10.2 upgrade steps other than to say TEST
TEST TEST and document the steps carefully down to the keystroke.
What I CAN talk about is the Upgrade From Hell that I and my co-DBA
did this past weekend.  We upgraded production from 10.1.0.3 to
10.1.0.5.  Just a patchset, so a rather minor upgrade, really.  In
theory, at least.  This is on Solaris 9, Veritas, Hitachi SAN, 3-node
RAC, Data Guard with physical and logical standbys, and an RMAN
catalog.  We had carefully tested and documented everything very
carefully and were expecting a 3-hour cakewalk (but ready for
anything, of course).  Well.  It took from noon Saturday until 9:00
Sunday night to get stable again.  The upgrade itself took close to 4
hours.  However, for some so-far inexplicable reason, Oracle decided
to switch the VIPs to a different network interface on each RAC
server.  We re-booted the 3 servers, then Veritas couldn't mount all
the file systems.  My co-DBA knows Veritas well and got that cleaned
up and after another re-boot, the servers couldn't NFS-mount the file
system that is used for DB_FILE_RECOVERY_DEST.  That required a
static-IP fix from our network engineer.  So, once we got everything
restarted, the instances started crashing after 20-40 minutes.  The
rest of the weekend was spent on the phone with Oracle Support.  We
went through three staff shifts at Oracle Support and each handoff
required the support engineer going through the logs and trace files
and getting up to speed on the issue.  We were about to punt and
switch to single-node and turn on more CPUs when an engineer in either
India or Australia (can't remember which...the engineer is Sandeep
Singla...BRILLIANT!) was able to identify the cause of the problem in
the VIP trace file.  It was occasionally timing out while checking the
default gateway.  The timeout threshold was 2 seconds and the engineer
had us change that to 10.  The timeouts were causing the instances on
the node to crash.  After 36 hours with 1 hour of sleep on a company
sofa for each of us and working with three shifts at Oracle Support,
our Production environment was stable again, just in the nick of time.
I'm almost caught up on sleep and I'm starting to unclench.

As you might guess, I'm now even more motivated to understand RAC
inside and out.

Other than using this forum to b***h about our upgrade experience, I
hope to have provided useful information.

Vivek, if you'd like a copy of our upgrade plan, I'd be happy to send
it.  It won't be directly applicable to your upgrade, of course, but
it might be useful.

Regards,
Mark Strickland
Seattle, WA
--
//www.freelists.org/webpage/oracle-l


Other related posts: