So, to prevent future outages, we're going to make this process a User Defined Metric, and have grid control fire off a notification if the service is offline. At the end of every maintenance window, if Grid Control says all systems are online, then my team can have more confidence that all systems are actually online.
Found a great article by Sagar Patil titled Oracle Grid Tracking OS Process Using a Custom User Defined Matrix udmshell script, which is the basis for what follows.
I created the following script that determines if ARG1 has ARG2 processes running and returns either Online or Offline:
#!/usr/local/bin/bashNext I create the UDM.
if [ $# -ne 2 ]
then
echo "Usage: `basename $0` {ORACLE_SID} {Expected Count}"
exit 1
fi
toCheck=$1
testEm=`/bin/ps -ef | /usr/local/bin/grep -v grep | /usr/local/bin/grep -v $0 |
/usr/local/bin/grep $toCheck | /usr/local/bin/wc -l`
if [ $testEm -eq $2 ]
then
echo "em_result=Online"
else
echo "em_result=Offline"
fi
From the target Host screen, click on "User-Defined Metrics", then click create.
Here's the parameters I used:
Metric Name: GURJOBS DEVLMetric Type: StringCommand Line: /banner/u01/app/oracle/product/agent11g/sysman/emd/custom/checkSubsystems.sh gurjobDEVL 1User Name: OracleComparison Operator: MatchCritical: Offline
The rest I left as default. After hitting OK, it took about 15 minutes to show that gurjobDEVL was in fact online. Shutting down gurjobs in DEVL returned an error in the UDM screen but did not send a notification.
To enable notifications, I had to do the following:
Grid Control User Preferences -> Rules -> Select "Host Availability and Critical States", then click Edit. In the Metrics tab, click Add, enter "User" in the search and click Go.
To enable the metric, check the box next to User Defined Metric, and then choose "Critical" and "Clear" severity states, then click continue, and then OK.
Whoever is subscribed to that notification rule will be sent a notification per their preferences.
In this simple case, knowing that Banner Jobsub is offline is great. Another example would be with CAPP Pipes, the next step is to build in automatic restart scripts. That will have to wait for another blog post.
No comments:
Post a Comment