Monday, March 2, 2009

HangRMAN D_bug

Resolved an issue this afternoon, and in the process learned some things about debugging rman.

Issue description:  RMAN would hang on exit.  It was really slow on this server, but it would seem to hang on exit.  I say seem to, because I did leave one exit hanging while I went to lunch.  An hour or two after I got back from lunch, I noticed that the exit completed successfully.

Interesting, no?

Here's what things looked like:

$ rman target /

Recovery Manager: Release - Production on Mon Mar 2 18:35:32 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: X (DBID=1)

RMAN> exit

Recovery Manager complete.
At that point RMAN would hang for 3 hours.  My initial thought and troubleshooting involved truss and lsof and whatnot, attempting to pin the problem on our avamar backup system -- nope.

Using the most excellent DocID: 412950.1 titled "How to determine if an RMAN failure or hang lies in the Media Manager Layer or not"

I found the following very easy way to trace runs of RMAN:

rman target / trace tracefilename.trc debug

You can also use catalog= in the example.  Anyhow, at the end of this trace file I found the following query which appeared to hang:

select round(sum(MBYTES_PROCESSED)) ,round(sum(INPUT_BYTES)) ,round(sum(OUTPUT_BYTES))  from V$RMAN_STATUS  start with (RECID=:b1 and STAMP=:b1)  connect by prior RECID=parent_recid

The trace was nice enough to also provide the bind vars.  Now the fun part, figuring out what to do.

Luckily, it was all spelled out in DocID 375386.1  "Rman Backup is Very Slow selecting from V$RMAN_STATUS".

The issue was resolved by refreshing fixed object stats.  Huzzah.    

One more thing, section 22 of 11g Database Backup and Recovery User's Guide titled "Troubleshooting RMAN Operations" was valuable as well.  

No comments: