How I save Cingular Wireless USD 30M

  • From: "Tom Pall" <oracle.list@xxxxxxxxx>
  • To: oracle-l <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 24 Aug 2007 14:13:56 -0500

I was asked to look into a database which was deadlocking.  This was one of
dozens of databases which handled switch data.  Nothing in the telecoms is
small.  These were up to 17 TB databases.  And they were on 8i and 9i and
dictionary managed.

I looked and looked.  I could find a reason why the database was deadlocking
except that it was running more slowly than before (vendor application which
had not been changed).  I opened up an iTAR with Oracle and of course got
back the usual "How to tune a database" and "What causes deadlocks"
citations.  I was infuriated. I've been with Oracle since Oracle 2, so I
didn't need this.

I thought and thought.  I ruminated.  I searched metalink.  In a dream one
night I had the thought "data dictionary corruption".  No kidding.  In a
dream.  So I searched Metalink and found script to report data dictionary
corruption.  I ran it and found that the database that was suffering from
deadlocks had an amazing number of orphaned index segments.  I called up
Oracle, got eventually able to pull rank with Cingular's rep and got to talk
to the head of Oracle Support and some internal gurus.  Yes, they said,
we've checked with developers and most likely internal errors (not one of
the ora-xxxx ones) was probably the culprit.  They told me to export the
database.  I told them to shove it, as I timed what an export/import would
take on this 24 X 7 realtime database:  12 months.  Oracle support said that
they could NOT work with me to fix these orphaned segments and what later
became 7 other types of corruption I discovered.  So now we're at the VP -
VP level.  Finally Oracle gives in and allows me to fix the data dictionary
/and keep the database and Oracle's support of it/.  I fixed it by getting
everyone off, putting it into restricted mode, running DML against that data
dictionary, committing than ABORTING the database.  Abort is necessary
because the data dictionary doesn't follow the SCNs the way normal DML
does.  So I restarted the database and SMON pegged at 90%.  After a couple
of hours, it was over.  I unrestricted the database, let the application in
and the database was running 4X faster than it had previously run.  We had
one database that was so highly corrupted that Oracle Support and I gave it
4 months to live.  So we cloned the database and kept feeds from the
previous database.  I spent Christmas and New Years two weeks a couple of
years ago, with the flu and a fever of 102 F working 14 hours home, getting
it ready to un-corrupt.  When I ran the DML, committed and bounced the
database, SMON spun at 90% for 6 weeks straight.  Eventually it stopped
spinning, everything was fixed, I got the old feed into the database and got
it ready, two months before the 4 month death sentence the then new
department of Oracle Support Internals I caused to come into being gave the
parent database to live.  I announced that the new clone was ready and open
for business, but of course no one paid attention until the day the parent
database closed down and refused to open.  There was a mad rush to certify
the new database.

All in all, I saved Cingular approximately USD 30M in not losing these very
vital databases and not having to continue to buy yet more hardware as they
had been doing for years before this DBA cane on the scene.  I achieved
increase in speed of the Oracle Engine from 2-10X what it was previously.

What exactly did I do to fix the corruption.  Sorry, but that is Oracle
Confidential.  If you have this problem I can give you names within Oracle
but go no further.   Yes, we eventually were able to convert to DMT (after
even more corruption fixing) and now the problem is gone.

Tom Pall
Oracle DBA

Other related posts: