RE: How I save Cingular Wireless USD 30M

  • From: "Kerber, Andrew W." <Andrew.Kerber@xxxxxxx>
  • To: oracle.list@xxxxxxxxx, "Jeremiah Wilton" <jeremiah@xxxxxxxxxxx>
  • Date: Mon, 27 Aug 2007 08:38:58 -0500

I'm curious.  Had CIngular done all their version upgrades using the
upgrade process as opposed to an export/import or some other method that
left a clean data dictionary?  It occurs to me that multiple oracle
version upgrades on the same system could theoretically cause this
problem.

 

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx
[mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of Tom Pall
Sent: Sunday, August 26, 2007 12:02 AM
To: Jeremiah Wilton
Cc: Niall Litchfield; oracle-l
Subject: Re: How I save Cingular Wireless USD 30M

 

Let me state that this is the last challenge I will answer.

The database was spending extra CPU time.

The particular database which went belly up (but I had cloned and fixed
and fed the backlogged data to) was unusable. It would not open up.  So
no steps could have been taken.  I could quote iTARS from Oracle Support
on this but that is Oracle and Cingular confidential. 

The $30 million figure was my boss's.  He told me how much they were
planning to budget to get the databases working at a proper speed.  Not
prevent loss of data, just to upgrade hardware.

I'm sorry to say that yes, my solution would have required psychic
abilities or perhaps a somewhat talented DBA, because Oracle Support had
the problem of the slowness and deadlocks elevated and elevated.  They
couldn't go any higher and Oracle did not have a solution until I
dreamed about data dictionary corruption and came upon hcheck.sql and
deduced what that problem was.  It took a while for Oracle Support to
work with developers to verify that truly this corruption was the cause
of the increasing slowness.  Then it took a couple weeks of negotiations
(threats from Cingular to go to DB2) before they agreed to allow me to
fix the data dictionary and keep the database instead of their memorized
method of fixing the data dictionary so you could export the data and
import it into another database. 

How many times do I have to tell you.  I ran Statspack reports at the
highest level of detail until I was blue in the face.  I ran traces.  I
set events.  But I also am by nature intuitive and tend often to use
intuition to solve a problem with facts to back up my intuitive
conclusion.  So after providing all of this stuff to Oracle Support,
they were at a loss, well, they were very eager to look at corruption as
a cause, because they didn't have another solution. 

Yes, the problems was solved.  Over the duration of my stint with
Cingular (I had one database which Oracle and I had to work up DML to
the data dictionary for a couple months, then apply it to a clone, which
resulted in the clone pegging the CPU with SMON running for 6 weeks
straight).  And I had many of these databases.  The problem got cleared
up when finally all of the 5 types of data dictionary corruption were
fixed with a total of 12 techniques, which not only speeded up the
databases (saving $30 million in hardware upgrades and perhaps having to
go to RAC), and then converting to LMT. 

So yes, I started on the problem during my first week at Cingular end
converted the last database to LMT during my last week at Cingular,
working on this problem (and the usual development/production DBA work)
for the duration of my tenure there.   The databases now have 10X as
much data than they had when they were built but run as fast as they did
when they were built years before. 

I am hereby ending my participation in this thread.  Flame me all you
want, I will just hit the delete key.

Tom in Austin

On 8/25/07, Jeremiah Wilton <jeremiah@xxxxxxxxxxx> wrote:

Tom, 

You say that the 'orphaned segments' caused a performance problem.  What
was the database spending time doing to cause this performance problem?
  If you had done nothing about the orphaned segments, what would have 
prevented someone from taking the same steps to manually update the data
dictionary at the point that the database became so slow as to be
unusable.

Your assertion that you saved Cingular $30MM seems to imply that had you

not taken action that there would have been complete loss of data.  Can
you characterize how that data loss would have occurred?

This response actually is not very technical.  My chief gripe is that it
doesn't say how a person like myself with no apparent psychic abilities 
vis-a-vis Oracle databases might have detected and resolved the problem.

Most people on this list (hopefully) use wait events, preferably via
ASH, to detect the root cause of performance problems.  How was the time

  being accounted for in the wait event interface?  DD reads are
accounted in that interface just as normal index and heap segment reads
are.  So you can see why some people here who approach problems in an
empirical manner might have questions about the character of the
problem. 

My questions in no way are meant to invalidate the way that you solved
the problem.  After all, if you solved it, regardless of how you
obtained the solution, wasn't the problem solved?

Thanks

Jeremiah Wilton
ORA-600 Consulting
http://www.ora-600.net

Tom Pall wrote:
> I did the traces, ran Staspack till I was blue in the face, set the
> events to trap deadlocks.  I did all of the things a DBA would do but 
> decided that there was something deeper than just two applications
> colliding, because as I worked the problem over a two week period, I
> noticed the database slowing down.  Not waits slowing down, not I/O 
> slowing down, just throughput slowing down.  Slowing down in ways
> neither I nor Oracle Support could explain before my dream, research
in
> Metalink and discovery of hcheck.sql in Metalink.
>
> Is this technical enough?

 


------------------------------------------------------------------------------
NOTICE:  This electronic mail message and any attached files are confidential.  
The information is exclusively for the use of the individual or entity intended 
as the recipient.  If you are not the intended recipient, any use, copying, 
printing, reviewing, retention, disclosure, distribution or forwarding of the 
message or any attached file is not authorized and is strictly prohibited.  If 
you have received this electronic mail message in error, please advise the 
sender by reply electronic mail immediately and permanently delete the original 
transmission, any attachments and any copies of this message from your computer 
system. Thank you.

==============================================================================

Other related posts: