Re: How I save Cingular Wireless USD 30M

  • From: "Tom Pall" <oracle.list@xxxxxxxxx>
  • To: "Jeremiah Wilton" <jeremiah@xxxxxxxxxxx>
  • Date: Sun, 26 Aug 2007 00:02:22 -0500

Let me state that this is the last challenge I will answer.

The database was spending extra CPU time.

The particular database which went belly up (but I had cloned and fixed and
fed the backlogged data to) was unusable. It would not open up.  So no steps
could have been taken.  I could quote iTARS from Oracle Support on this but
that is Oracle and Cingular confidential.

The $30 million figure was my boss's.  He told me how much they were
planning to budget to get the databases working at a proper speed.  Not
prevent loss of data, just to upgrade hardware.

I'm sorry to say that yes, my solution would have required psychic abilities
or perhaps a somewhat talented DBA, because Oracle Support had the problem
of the slowness and deadlocks elevated and elevated.  They couldn't go any
higher and Oracle did not have a solution until I dreamed about data
dictionary corruption and came upon hcheck.sql and deduced what that problem
was.  It took a while for Oracle Support to work with developers to verify
that truly this corruption was the cause of the increasing slowness.  Then
it took a couple weeks of negotiations (threats from Cingular to go to DB2)
before they agreed to allow me to fix the data dictionary and keep the
database instead of their memorized method of fixing the data dictionary so
you could export the data and import it into another database.

How many times do I have to tell you.  I ran Statspack reports at the
highest level of detail until I was blue in the face.  I ran traces.  I set
events.  But I also am by nature intuitive and tend often to use intuition
to solve a problem with facts to back up my intuitive conclusion.  So after
providing all of this stuff to Oracle Support, they were at a loss, well,
they were very eager to look at corruption as a cause, because they didn't
have another solution.

Yes, the problems was solved.  Over the duration of my stint with Cingular
(I had one database which Oracle and I had to work up DML to the data
dictionary for a couple months, then apply it to a clone, which resulted in
the clone pegging the CPU with SMON running for 6 weeks straight).  And I
had many of these databases.  The problem got cleared up when finally all of
the 5 types of data dictionary corruption were fixed with a total of 12
techniques, which not only speeded up the databases (saving $30 million in
hardware upgrades and perhaps having to go to RAC), and then converting to

So yes, I started on the problem during my first week at Cingular end
converted the last database to LMT during my last week at Cingular, working
on this problem (and the usual development/production DBA work) for the
duration of my tenure there.   The databases now have 10X as much data than
they had when they were built but run as fast as they did when they were
built years before.

I am hereby ending my participation in this thread.  Flame me all you want,
I will just hit the delete key.

Tom in Austin

On 8/25/07, Jeremiah Wilton <jeremiah@xxxxxxxxxxx> wrote:
> Tom,
> You say that the 'orphaned segments' caused a performance problem.  What
> was the database spending time doing to cause this performance problem?
>   If you had done nothing about the orphaned segments, what would have
> prevented someone from taking the same steps to manually update the data
> dictionary at the point that the database became so slow as to be
> unusable.
> Your assertion that you saved Cingular $30MM seems to imply that had you
> not taken action that there would have been complete loss of data.  Can
> you characterize how that data loss would have occurred?
> This response actually is not very technical.  My chief gripe is that it
> doesn't say how a person like myself with no apparent psychic abilities
> vis-a-vis Oracle databases might have detected and resolved the problem.
> Most people on this list (hopefully) use wait events, preferably via
> ASH, to detect the root cause of performance problems.  How was the time
>   being accounted for in the wait event interface?  DD reads are
> accounted in that interface just as normal index and heap segment reads
> are.  So you can see why some people here who approach problems in an
> empirical manner might have questions about the character of the problem.
> My questions in no way are meant to invalidate the way that you solved
> the problem.  After all, if you solved it, regardless of how you
> obtained the solution, wasn't the problem solved?
> Thanks
> Jeremiah Wilton
> ORA-600 Consulting
> Tom Pall wrote:
> > I did the traces, ran Staspack till I was blue in the face, set the
> > events to trap deadlocks.  I did all of the things a DBA would do but
> > decided that there was something deeper than just two applications
> > colliding, because as I worked the problem over a two week period, I
> > noticed the database slowing down.  Not waits slowing down, not I/O
> > slowing down, just throughput slowing down.  Slowing down in ways
> > neither I nor Oracle Support could explain before my dream, research in
> > Metalink and discovery of hcheck.sql in Metalink.
> >
> > Is this technical enough?

Other related posts: