Wednesday, March 30, 2011

Grid control claims App Server 10.1.2 is down wen console says it's up

I'm currently going through moving all of my targets from grid control repositories and in the process of doing that I'm cleaning up anything that doesn't need to be there, or anything that reports as being down. For some reason I like to see the fully green pie chart.

I had added one of my oracle mid tiers that had a version of running to support our forms application and grid control reported that the app server was down, but I knew for a fact it was not. All the subcomponents were reporing as up, opmnctl reported everything up, and the iasconsole on the server itself reported everything was up. In summary, it was up.

After much digging through perl code and metalink docs I have resolved the issue and figured out it had to do with Solaris 10 security of /usr/ucb/ps command not being able to report extended process info without being the owner of the process. There was a pretty good metalink document that put me on this trail, but didn't solve the problem itself. Here's a metalink ID roundup for what helped (or didn't but was related).

395013.1 - Application Server shown as Down in Grid Control even though all of its Components are shown as up
276350.1 - How to Enable the Metric Browser/Agent Browser for the Oracle Management Agent [Video]

Enabling the metric browser was essential for this, in addition to running the perl scripts manually from the host.

So the problem ended up being this, the agent checks for forms processes on the host and reports back the sid.
Run this as your agent owner, and then also as root to see the difference.
/usr/ucb/ps -axww | grep oc4j.jar | grep OC4J_BI_Forms | grep <$ORACLE_HOME>

Since the process is owned by a different user than the agent, it can't get the extended process info like root can. So to solve that two things had to be done.

First set up sudo privileges for your agent user to run /usr/ucb/ps as root
gridagent          ALL=(root)NOPASSWD: /usr/ucb/ps

Second, modify the perl code in $AGENT_HOME/sysman/admin/scripts/ to use sudo.
elsif ( $os eq "SunOS" )
$PS = "/usr/ucb/ps -axww";

elsif ( $os eq "SunOS" )
$PS = "sudo /usr/ucb/ps -axww";

Restart the agent force an upload and check all your metrics again. You can do that via the metric browser, the commandline or wait for grid control to register the target again.

If anyone finds a metalink document that covers this exactly please add something in the comments.

Tuesday, March 8, 2011

OCM patch required even though you're not using OCM

While installing OAS 10gR3 patch 8626084 to upgrade OAS to, I received an error that step "Run One-off OPatches" had failed.

Looking at the log file, I found the following:

This is a OCM patch.
Home has OCM installed but not configured.

To run in silent mode, OPatch requires a response file for Oracle Configuration Manager (OCM).
Run /u01/app/oracle/product/oas10gr3/OPatch/ocm/bin/emocmrsp to generate an OCM response file. The generated response file can be reused on different platforms and in multiple OPatch silent installs.

To regenerate an OCM response file, rerun /u01/app/oracle/product/oas10gr3/OPatch/ocm/bin/emocmrsp.

ERROR: OPatch failed because of cmd. args. problem.

Doing a bit of digging, I found this recommendation to go ahead and run setupCCR in disconnected mode.

What's frustrating is that I requested that OCM not be installed or configured.

The following did the trick and I was able to complete the patch installation.

/u01/app/oracle/product/oas10gr3/ccr/bin/setupCCR -s -d