RE: 11g fault diagnosability infrastructure and poor documentation

From: "Robert Freeman" <robertgfreeman@xxxxxxxxx>
To: <jeremiah@xxxxxxxxxxx>, "ORACLE-L" <oracle-l@xxxxxxxxxxxxx>
Date: Tue, 2 Oct 2007 20:40:44 -0600
The health checks are perhaps better documented, and you can get some
insights into them by using OEM which provides a window into them (the
"checkers" as they are called). They feed some of the new features like the
Data Repair Advisor and so on.

The incident packages are probably easier to understand if you go through
OEM too and follow some of the workflow they have there. I noticed that when
I created a database with DBCA that there were two corrupted segments in the
data dictionary just waiting for me to package. Did anyone else notice that?

As for the rest..... yeah, there is some frustration there. I don't like
that I have to setup the config manager to use the automated SR packaging to
it's fullest extent (I just want to put in my Metalink ID and have it go
from there).

As for automatic hang detection.... well, if I could simulate a hang
reliably... ;-)

There are a lot of new mysteries in 11g that can potentially slide up and
bite you.... SQL Plan Management is a big one IMHO. Beware. It's a good
idea, but can really cause you grief is you are tuning and don't realize
what is going on in the background.

I also think this is the tip of the iceburg ... 11gR2 and beyond will likely
build on these architectures. It's going to be important to understand these
new architectural features (particularly things like Automated SQL tuning
and SQL Plan Management)... perhaps more than ever since they have more
potential to jump up and kick us upside the head.

Of course, you can just turn them off too... ;-)

Much of this is covered in my 11g New Features book.... Coming soon!! Very
soon!!

RF

Robert G. Freeman
Oracle Consultant/DBA/Author
Principal Engineer/Team Manager
The Church of Jesus Christ of Latter-Day Saints
Father of Five, Husband of One,
Author of various geeky computer titles
from Osborne/McGraw Hill (Oracle Press)
Oracle Database 11g New Features Now Available for Pre-sales on Amazon.com!
BLOG: http://robertgfreeman.blogspot.com/
Sig V1.2

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx
[mailto:oracle-l-bounce@xxxxxxxxxxxxx]On Behalf Of Jeremiah Wilton
Sent: Tuesday, October 02, 2007 7:30 PM
To: ORACLE-L
Subject: 11g fault diagnosability infratructure and poor documentation


Am I the only one who has been unable to do much with this feature due
to the woefully absent documentation?  Three components of "fault
diagnosability" in particular seem very interesting:

- automatic hang detection
- automatic reactive "health checks"
- incident packages as a replacement for RDA

Hang detection seems like a great idea, but there is no information on
precisely what constitutes a "hang" according to DIAG and DIA0.  These
processes seem never to wake up, even in the most dire of hanging
situations.  I did find that by default in single-instance databases,
the _hang_resolution, _hm_analysis_output_disk and _hm_log_incidents
parameters are set to FALSE, which I take to mean the feature is turned
off.  Even turned on, long hangs involving chains of waiters visible in
hanganalyze output do not trigger any actions that I can discern. This
is slightly complicated by the fact that two components of "fault
diagnosability" share the initials HM, and packages, parameters and
views use HM interchangeably to mean "hang manager" and "heath monitor".

As for Health Checks, there is no documentation indicating what kinds of
events or incidents might result in a "reactive" health check.  The
existence of reactive health checks is repeatedly asserted in the
documentation, and there is even a parameter called _diag_hm_rc_enabled
with the description "Parameter to enable/disable Diag HM Reactive
Checks".  Set to FALSE by default, this parameter does nothing in the
event of a badly degraded and hanging system either.  We are left to
wonder what "reactive" health checks react to!

Finally, the incident packaging service works well enough, but is
predicated completely upon the notion that any and all problems will be
associated with a fatal error of some kind.  Anything that does not dump
ORA-600 or another fatal error will not result in an "incident" and thus
there is nothing to package.  There is apparently no provision for
problems that do not dump on an error. So, an on-demand incident package
apparently cannot be created.  Thus, despite the incident payloads
having many of the same contents as the horrid RDA of yore, you cannot
generate one on demand in a supported way.  You can shoot a server
process with a SIGSEGV, but I cannot imagine that is how Oracle intends
us to get diagnostic data for opening an SR.

You can probably detect that I am frustrated but I have been playing
with this feature set for weeks and it is a frustrating morass of
nonworking undocumented wastes of server memory.  Remember, we are all
now running two extra background processes, DIAG and DIA0, just for this
feature.  They are up and running and using memory on all of our 11g
systems even if they do nothing and are turned off at the parameter
level by default.

I am ranting here in hopes that someone else has gotten further than I
have or knows someone on the inside who can shed some light on these
concerns.

Thanks,

Jeremiah Wilton
ORA-600 Consulting
--
//www.freelists.org/webpage/oracle-l


--
//www.freelists.org/webpage/oracle-l
References:
- 11g fault diagnosability infratructure and poor documentation
  - From: Jeremiah Wilton
RE: 11g fault diagnosability infrastructure and poor documentation

Other related posts: