RE: A Challenge - My Answer

Thanks for your answers. 
 
Some of you are using software for your solution, this is nice when it
is available but there is the time to install, configure, and deploy to
multiple servers/environments (for me anyway). Others have written
scripts in the past but have moved to other tools. Others had some
scripts they were currently using.
 
My script is below. I actually took my load monitoring script, decided
it was pretty ugly and thought that I should just create a script that
monitors numbers and it would be a bit more useful for other things.
 
Some ways I could use the script below...(the load average example is in
the script).
 
# if >= 200 oracle processes for > 30 minutes then alert and sleep 60
minutes
watchnum.ksh -t 30 -s 60 smon $(ps -ef | grep oracle | wc -l) 200
epost1@xxxxxxxxx
 
# alert if a process has accumulated more than N cpu time and does not
go away after N time.
should be easy, sorry no example
 
you get the idea, I am also going to port below to PL/SQL as well as
make a few enhancements.
 
- Ethan
 
========================================================================
=======================
BEGIN KSH SCRIPT
========================================================================
=======================
 
#!/usr/bin/ksh
 
typeset -i SLEEP_UNTIL_TIME OVER_THRESHOLD_TIME MIN_SLEEP_TIME
CURRENT_TIME BEGIN_TIME
typeset -u WATCH_ID
 
MAX_ALLOWED_TIME=0
MIN_SLEEP_TIME=0
SEND_OK_ON_RESET=N
LOG_FILE=/tmp/watchnum.log
TMP_DIRECTORY=/tmp
HEADLINE="watchlog.ksh"
 
function uhoh {
if (( ${1} )); then
   echo "uhoh: ${2}"
   exit 1
fi
}
 
function current_minutes {
   # This very UGLY function calculates the # of minutes since the year
2000.
   MIN_YEAR=$( date +"%Y" )
   MIN_YEAR=$( expr ${MIN_YEAR} - 2000 )
   MIN_YEAR=$( expr ${MIN_YEAR} \* 525600 )
   MIN_DAYS=$( date +"%j" )
   MIN_DAYS=$( expr "${MIN_DAYS}" - 1 )
   MIN_DAYS=$( expr "${MIN_DAYS}" \* 1440 )
   MIN_HOURS=$( date +"%H" )
   MIN_HOURS=$( expr "${MIN_HOURS}" \* 60 )
   MIN_MINS=$( date +"%M" )
   MIN_TOTAL=$(( ${MIN_YEAR} + ${MIN_DAYS} + ${MIN_HOURS} + ${MIN_MINS}
))
   echo ${MIN_TOTAL}
}
 
CURRENT_TIME=$(current_minutes)
 
while getopts :t:s:a:l:oh: options
do
   case $options in
      t) MAX_ALLOWED_TIME=${OPTARG} ;;
      s) MIN_SLEEP_TIME=${OPTARG} ;;
      l) LOG_FILE="${OPTARG}" ;;
      o) SEND_OK_ON_RESET=Y ;;
      h) HEADLINE="${OPTARG}" ;;
     \?) print ${OPTARG} is not a valid argument. ;;
   esac
done
 
shift $(expr $OPTIND - 1)
 
usage() {
cat <<USAGE
 
Script:
watchnum.ksh
 
Options:
 
-o Sends an everything is OK message when the monitored value
   falls below the defined threshold.
-t Sets MAX_ALLOWED_TIME. The number of minutes the monitored
   value is allowed to exceed the threshold before triggering an alert.
-s Sets MIN_SLEEP_TIME. The number of minutes to ignore alerts
   for after an alert has been triggered. This helps cut down the
   number of emails and pages when you already know there is a problem.
-h Sets HEADLINE. This is the string that will appear in the subject
   of the email of page.
-l Sets LOG_FILE. This defaults to /tmp/watchnum.ksh unless specified.
 
Parameters (1-3 are required):
 
\$1 WATCH_ID - User specified ID for this alert, no spaces no silly
    characters.
\$2 CURRENT_VALUE - Current value of the number related to this alert.
\$3 THRESHOLD - The threshold that will trigger the alert.
\$4 EMAILS/PAGERS - List of emails with commas between them.
 
Examples:
 
# If server load average is over 8 for 2 hours send email.
watchnum.ksh -o -t 120 -s 180 -h "Server Load Warning" \\
   -l /home/oracle/log/watchnum.log loadavg \$(uptime | awk '{ print
substr($(NF-2),1,4) }') \\
   8 epost1@xxxxxxxxx
 

USAGE
exit 1
}
 
if (( $# == 0 )); then
   usage;
fi
 

# Exit if these parameters are not supplied.
[[ -z "${1}" || -z ${2} || -z ${3} ]] && usage
 
WATCH_ID=${1}
CURRENT_NUMBER=${2}
THRESHOLD=${3}
EMAILS="${4}"
ALERT_OR_OK=
 
if [[ -n ${LOG_FILE} ]]; then
   touch ${LOG_FILE} || uhoh $? "Cannot create ${LOG_FILE}."
fi
 
TINY="${TMP_DIRECTORY}/watchval_${WATCH_ID}.dat"
[[ -f "${TINY}" ]] || echo "${WATCH_ID}:0:0" > ${TINY} || uhoh $? "Could
not create ${TINY}."
 
SLEEP_UNTIL_TIME=$(cat ${TINY} | awk -F":" '{ print $2}')
BEGIN_TIME=$(cat ${TINY} | awk -F":" '{ print $3}')
 
if (( ${CURRENT_NUMBER} >= ${THRESHOLD} )); then
 
   # When over threshold and begin is still zero, then this is first
time over
   # the threshold and we will set begin to current time.
   if (( ${BEGIN_TIME} == 0 )); then
      BEGIN_TIME=${CURRENT_TIME}
      echo "${WATCH_ID}:${SLEEP_UNTIL_TIME}:${BEGIN_TIME}" > ${TINY}
   fi
 
   # If we are not currently in a sleep cycle.
   if (( ${CURRENT_TIME} >= ${SLEEP_UNTIL_TIME} )); then
      # Get the # of minutes we have been over threshold.
      OVER_THRESHOLD_TIME=$( echo "${CURRENT_TIME} - ${BEGIN_TIME}" | bc
-l )
      # If # of minutes is more than allowed trigger alert.
      if (( ${OVER_THRESHOLD_TIME} >= ${MAX_ALLOWED_TIME} )); then
         # We will sleep until stated, this will require an update to
the record.
         SLEEP_UNTIL_TIME=$( echo "${CURRENT_TIME} + ${MIN_SLEEP_TIME}"
| bc -l)
         echo "${WATCH_ID}:${SLEEP_UNTIL_TIME}:${BEGIN_TIME}" > ${TINY}
         ALERT_OR_OK="ALERT"
      fi
   fi
else
   # If we fall under threshold reset the entire record.
   echo "${WATCH_ID}:0:0" > ${TINY}
   if (( ${BEGIN_TIME} > 0 )); then
      [[ "${SEND_OK_ON_RESET}" = "Y" ]] && ALERT_OR_OK="OK"
   fi
fi
 
echo "$(hostname)|${WATCH_ID}|$(date +"%m/%d/%Y
%H:%M")|${CURRENT_NUMBER}|${SEND_OR_OK}" >> ${LOG_FILE}
 
if [[ -n "${ALERT_OR_OK}" ]]; then
   for EMAIL_ADDRESS in ${EMAILS}; do
      echo "${ALERT_OR_OK} ${WATCH_ID}=${CURRENT_NUMBER},
host=$(hostname)" | mailx -s "${HEADLINE}" "${EMAIL_ADDRESS}"
   done
fi
 
exit 0

Other related posts:

  • » RE: A Challenge - My Answer