Re: Monitoring emails

  • From: Mark Brinsmead <mark.brinsmead@xxxxxxxxx>
  • To: Jeremy Schneider <jeremy.schneider@xxxxxxxxxxxxxx>
  • Date: Fri, 28 Aug 2015 14:47:32 -0600

For starters here, I will do as others have in this thread and assume that
"email" is equivalent to "page" for the purpose of this discussion.

I have generally found that good preventive maintenance is the best way to
minimize pages. If your "non-critical" checks are designed well, they will
predict conditions that may eventually lead to a more critical (pager)
event, allowing you to deal with them more-or-less at your convenience,
within regular working hours.

If you find yourself being paged because a tablespace has run out of space
or an index has reached maxextents (does that *ever* happen any more?) the
fault is almost certainly your own, for having failed predicted the problem
days earlier and fix it *before* a runtime incident could occur.

The other half of minimizing pages, of course, is to ensure that your
non-critical health checks do not page. There is little point in having
health cheacks with multiple levels of criticality, if all alerts are
delivered by the same mechanism. For the non-critical tests, a once-daily
report of all (unresolved) issues is probably what you want -- its even
better if this automatically becomes some sort of checklist or to-do list
for the junior members of the team.

Of course, simply stratifying your tests into "critical" and "noncritical"
is not enough. You need to carefully craft your non-critical tests with
the intent of identifying and resolving problems *before* they become
critical.

At the end of the day, you probably end up doing almost as much work -- but
you'll be doing less of it at 3AM with the CIO breathing down your neck.

Once you get the paging under control, you can then look at practices (like
better storage management or segment management, or...) to automate or
eliminate the most common tasks arising from your non-critical monitoring.
This can also include things like identifying unstable infrastructure
elements (flakey storage, unreliable backu servers, etc.) and giving
yourself the kind of "solid infrastructure" Jeremy refers to. [I recall a
very concrete example of this -- I once had a client with severe and
widespread performance and stability issues almost all of which were
attributed to using NFS storage over a 10 Mbit ethernet. Strangely, it
took *years* to get them to do it, but once they upgraded the network to
Gigabit, my workload for that customer dripped by about 99% -- both in
terms of paging events and day-to-day maintenance.)

DBAs will probably never run out of work. But the more "silly stuff" you
can automate or eliminate, the more of your time you can spend delivering
the really high-value services you (and probably everybody else) prefer you
to be doing.

On Fri, Aug 28, 2015 at 2:03 PM, Jeremy Schneider <
jeremy.schneider@xxxxxxxxxxxxxx> wrote:

i head up much of this in our dba group - and i think we're pretty
brutal when it comes to trimming down the pages that actually come out
to our DBA phones/pagers. we've had weeks go by that we don't get a
single page and then i'm relieved when i get a page and see that
everything is still working fine. :) this is also because we have a
great team and a solid infrastructure with infrequent major problems.

my DBA team is not global, so we generally work north american
business hours. when my phone beeps, i usually go look at it even if
i'm in the middle of dinner with the kids. i value my own evenings
and weekends - and i value the time and attention of other DBAs that i
work with. so we don't want our phones to beep for anything that
could have waited until the next morning when someone gets the office
and checks their email.

obviously we get paged if an important system is unavailable. we do
have some "non-production" systems which would need off-hours
attention if they are unavailable - of course this is really worked
out with the business. but we've ruthlessly trimmed down the noise and
our management goes to bat for us when everybody wants their stuff to
be critical. honestly, over the past few years, i can't think of many
issues we had where the business really needed a DBA to interrupt
their dinner or weekend to look at it immediately. there were a few,
but not many. just today i added a custom OEM metric on percentage of
processes used, because we became aware of an issue where something
could exhaust the processes on a database -- and we will now get a
page if process usage goes above a certain threshold. but we've
already taken so other steps to address the problem and i don't expect
many pages - if any.

now that was all just about pages. going back to the original
question about automated emails, it's another subject entirely. i like
to get emails and i have lots of server-side filters that move them
into folders that my email client doesn't even look at until i click
there. our SAs don't trim down their alerts like we do - so i get a
decent amount of traffic from their monitoring system. but i like
that. the key here is that it's all informational, and on the DBA team
we don't expect ourselves to read them - just what we're interested
in. people can setup filters to get rid of stuff they don't care
about. i don't usually check my email when i'm not at work, so it
doesn't bother me to get extra noise emails.

so far today i've got 4 pages (which also come as emails), 32
backup-related emails (there was a minor issue) and 33 miscellaneous
emails from monitoring systems that i actually watch - that is, i look
in the email folder occasionally and skim the subject lines and mark
them as read. of course i would give attention to anything that
really needs it, but all the critical stuff comes to our phones
anyway.

so far today i've got 192 "junk" emails from various other monitoring
systems which i don't watch at all - on rare occasions i'll dig into
one of those folders to look for something specific but otherwise i
completely ignore it.

-Jeremy

--
http://about.me/jeremy_schneider


On Fri, Aug 28, 2015 at 3:22 PM, Mayen Shah <mshah@xxxxxxxxxxxxxxx> wrote:
Your main goal should be to identify an act upon critical issues in your
environment. Of course there will be informational alerts/emails.



Imagine few hundred alerts (minimum 1 minute per alert) * n number of
DBAs.
Is it really productive? And among all the noise likelihood of missing
critical alerts are very high. One can argue that he/she ignore or do not
act upon x% of alerts. I am of the opinion that if you ignore any alert,
it
is not worth alerting on.



I have worked in environment where we will categorize alerts into
informative, warning, critical and emergency. Setup rules so emails are
organized and emergency and critical alerts are not missed.



Thanks

Mayen



From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:
oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of Alfredo Abate
Sent: Friday, August 28, 2015 3:02 PM
To: veeeraman@xxxxxxxxx
Cc: ORACLE-L
Subject: Re: Monitoring emails



Ram,



The important question is how many are Critical alerts vs Warning
alerts? I
would think if you are getting 100s of Ciritcal alerts there is a
problem.
:) I can see getting Warning alerts frequently but perhaps you filter
those
to go to different folder, etc that can be reviewed a few times per day.



For us we get Warning alerts throughout the day (maybe 25 - 100) and
Critical alerts very few if any per day. It all depends what is
important
to you and the team managing the databases. This is where trying to
find a
good balance between being proactive towards preventing something and
general "noise" can become an art as much as it is a science.





Alfredo



On Fri, Aug 28, 2015 at 1:10 PM, Ram Raman <veeeraman@xxxxxxxxx> wrote:

List,



How many automated emails do listers get from all the databases that are
being monitored on a daily basis?



We get a few hundred emails a day (<100 DBs), but some new members here
feel
that is too many and want us to cut down on that. I personally feel that
most of the messages are relevant to us.



Thanks

Ram.

--






--
//www.freelists.org/webpage/oracle-l



Other related posts: