Over the weekend we had a SungardHE Banner subsystem offline. This particular subsystem uses Oracle Pipes to manage user job submission. The downtime could have been prevented if this subsystem had been represented as a target within Grid Control.
So, to prevent future outages, we're going to make this process a User Defined Metric, and have grid control fire off a notification if the service is offline. At the end of every maintenance window, if Grid Control says all systems are online, then my team can have more confidence that all systems are actually online.
Found a great article by Sagar Patil titled
Oracle Grid Tracking OS Process Using a Custom User Defined Matrix udmshell script, which is the basis for what follows.
I created the following script that determines if ARG1 has ARG2 processes running and returns either Online or Offline:
#!/usr/local/bin/bash
if [ $# -ne 2 ]
then
echo "Usage: `basename $0` {ORACLE_SID} {Expected Count}"
exit 1
fi
toCheck=$1
testEm=`/bin/ps -ef | /usr/local/bin/grep -v grep | /usr/local/bin/grep -v $0 |
/usr/local/bin/grep $toCheck | /usr/local/bin/wc -l`
if [ $testEm -eq $2 ]
then
echo "em_result=Online"
else
echo "em_result=Offline"
fi
Next I create the UDM.
From the target Host screen, click on "User-Defined Metrics", then click create.
Here's the parameters I used:
Metric Name: GURJOBS DEVL
Metric Type: String
Command Line: /banner/u01/app/oracle/product/agent11g/sysman/emd/custom/checkSubsystems.sh gurjobDEVL 1
User Name: Oracle
Comparison Operator: Match
Critical: Offline
The rest I left as default. After hitting OK, it took about 15 minutes to show that gurjobDEVL was in fact online. Shutting down gurjobs in DEVL returned an error in the UDM screen but did not send a notification.
To enable notifications, I had to do the following:
Grid Control User Preferences -> Rules -> Select "Host Availability and Critical States", then click Edit. In the Metrics tab, click Add, enter "User" in the search and click Go.
To enable the metric, check the box next to User Defined Metric, and then choose "Critical" and "Clear" severity states, then click continue, and then OK.
Whoever is subscribed to that notification rule will be sent a notification per their preferences.
In this simple case, knowing that Banner Jobsub is offline is great. Another example would be with CAPP Pipes, the next step is to build in automatic restart scripts. That will have to wait for another blog post.