Wednesday, March 30, 2011

Grid control claims App Server 10.1.2 is down wen console says it's up

I'm currently going through moving all of my targets from grid control repositories and in the process of doing that I'm cleaning up anything that doesn't need to be there, or anything that reports as being down. For some reason I like to see the fully green pie chart.

I had added one of my oracle mid tiers that had a version of running to support our forms application and grid control reported that the app server was down, but I knew for a fact it was not. All the subcomponents were reporing as up, opmnctl reported everything up, and the iasconsole on the server itself reported everything was up. In summary, it was up.

After much digging through perl code and metalink docs I have resolved the issue and figured out it had to do with Solaris 10 security of /usr/ucb/ps command not being able to report extended process info without being the owner of the process. There was a pretty good metalink document that put me on this trail, but didn't solve the problem itself. Here's a metalink ID roundup for what helped (or didn't but was related).

395013.1 - Application Server shown as Down in Grid Control even though all of its Components are shown as up
276350.1 - How to Enable the Metric Browser/Agent Browser for the Oracle Management Agent [Video]

Enabling the metric browser was essential for this, in addition to running the perl scripts manually from the host.

So the problem ended up being this, the agent checks for forms processes on the host and reports back the sid.
Run this as your agent owner, and then also as root to see the difference.
/usr/ucb/ps -axww | grep oc4j.jar | grep OC4J_BI_Forms | grep <$ORACLE_HOME>

Since the process is owned by a different user than the agent, it can't get the extended process info like root can. So to solve that two things had to be done.

First set up sudo privileges for your agent user to run /usr/ucb/ps as root
gridagent          ALL=(root)NOPASSWD: /usr/ucb/ps

Second, modify the perl code in $AGENT_HOME/sysman/admin/scripts/ to use sudo.
elsif ( $os eq "SunOS" )
$PS = "/usr/ucb/ps -axww";

elsif ( $os eq "SunOS" )
$PS = "sudo /usr/ucb/ps -axww";

Restart the agent force an upload and check all your metrics again. You can do that via the metric browser, the commandline or wait for grid control to register the target again.

If anyone finds a metalink document that covers this exactly please add something in the comments.

No comments: