Re: How I save Cingular Wireless USD 30M

  • From: "Tom Pall" <oracle.list@xxxxxxxxx>
  • To: "Bill Ferguson" <wbfergus@xxxxxxxxx>
  • Date: Mon, 27 Aug 2007 15:47:24 -0500

Cingular had done upgrades over the years in situ, as these were always, for
the hardware, gigantic databases.

It did not occur to Oracle Support experts, whose names I prefer to not
mention, that this problem happens with upgrades.  It is a /known/ problem
with dropping partitioned IOTs that sometimes the recursive DDL does not
complete, leaving the data dictionary in a /mess/.

I told everyone, if you'd only read, how you can tell if you have a corrupt
data dictionary.  Run hcheck.sql .

There were no messages in the alert log besides deadlock messages.  I
repeat.  Oracle was throwing /internal/ errors, not the errors listed in the
message/event file which contains all of the ORA-XXXX.

This is a known problem, this dropping IOT partitions.  But it does not have
a bug number that I know of because a bug implies something that's been
caught by Oracle.  They know it exists, but don't know any more about it.
Or at least they didn't know any more about it when I left Cingular.  I do
know that it was not fixed in 10gR1, according to the Oracle Internals gurus
I dealt with.

I just don't think everybody gets it.  Please try to think outside the box
just for a moment.  This was not your common, everyday run of the mill
problem which revealed itself in the alert log.  I did everything a DBA
could do: set events, ran traces, ran Statspack reports, iostat, vmstat.  I
was able to correlate the slowness of queries and batch jobs to the result
from hcheck.sql.  I had my boss, the SA, bring back 6 month old and 1 year
old backups of databases and checked the results of hcheck in them and the
results of Statspack reports and also look at the application's log to see
how long it took to run batch jobs.  The more dd corruption, the slower
things ran, the less corruption, the faster things ran.  And Oracle Support
Internals Group and I predicted the demise of our biggest database.  We gave
it 5 months, it died after 7 months as the result of hcheck got bigger and
bigger.

I feel that I've explained enough.  If you discover that your database seems
to be slowing down and deadlocks are appearing where no one has changed the
code or the load, then run hcheck.sql, take the results and contact Oracle
Support Internals Group.  They and their management will remember that guy
at Cingular Wireless with those dozens and dozens of iTARS.

Tom Pall
An Oracle DBA who's very sorry he shared this with Oracle-L:

On 8/27/07, Bill Ferguson <wbfergus@xxxxxxxxx> wrote:
>
> Well along with "liking to know" how to fix the problem, which evidently
> we won't know unless or until the exact same symptoms appear on our systems
> and Oracle Support divulges the information then, I'd also like to know what
> caused the problems and what exactly the symptoms were.
>
> Just a "slow" database is rather vague. Were their consistent messages in
> the alert log that pointed to something, or was everything acting like 10x
> more users than normal were accessing the system, or what.
>
> Knowing what caused the problem as well would be beneficial, in case the
> same sort of process (or processes) were taken here. I've only done an
> "upgrade" once, usually I prefer to always do a clean install and then
> export from the old version and import into the new version, just so
> everything stays as "clean" as possible, but if this was done at Cingular,
> did anybody have any ideas on how the corruption occured in the first place?
>
>
> --
> -- Bill Ferguson
>

Other related posts: