Tivoli has been configured on the database servers to restart the
monitoring scripts when it goes down which includes OSWatcher as well.
When Tivoli restarts OSWatcher script, we have observed that the commands
such as ps -elk ( from topaix.sh) and ps -ae -o
user,pid,ppid,pri,pcpu,pmem,vsz,rssize,wchan,stat,etime,time,args ( from
psmemswsub.sh) are getting hung and spawning many processes and eventually
server becomes unresponsive with heavy load resulted from these processes.
What we had seen normally is that the ppid of the above commands are the
pid of topaix.sh and psmemswsub.sh respectively, but in this case the ppid
of these commands is 1.
The below from Monitoring team :-
The Tivoli agent (k08agent) runs as root and executes a script to check for
processes running. If it finds a process down, it attempts to restart it
using the command provided by the support/app team. In this case:
su - oracle -c "/users/oracle/local/prod/sh/start_osw_generic.ksh"
This command is put into the RESTART_PROCESS variable, and executed in the
eval "$RESTART_PROCESS &" > /dev/null 2>&1
$ cat start_osw_generic.ksh
nohup /ora45/dbworkspace/OSWATCHER/oswbb/startOSWbb.sh 30 120 &
Operating system :- AIX P9 64bit 7.2.
We have first tried with OSW version 7.3 and then tried with version 8
just to rule out any issue with OSW version.
Note :- When we run the command su - oracle -c
"/users/oracle/local/prod/sh/start_osw_generic.ksh" after logging in as
root user , we dint face any issue.
Please let me know if you have any suggestion on this.