Saturday, February 19, 2011

Work around ORA-600: internal error arguments: [3020], [3 ]– Oracle 11gr2 RAC – Active data-guard

Mitigate/Work around ORA-00600: internal error code, arguments: [3020], [3], [346523], [12929435], [], [], [], [], [], [], [], [] – Oracle 11gr2 RAC – Active data-guard

Let me brief the background, we have banking database with 2 node RAC in production and we have configured Active data-guard and it is fully utilized for reports database for every minute and we cannot effort down time for even for a minute.

The steps I followed were Inspired by “The Arup Nanda Blog: Resolving Gaps in Data Guard Apply Using Incremental RMAN Backup “ :  http://arup.blogspot.com/2009/12/resolving-gaps-in-data-guard-apply.html

A big thanks to Arup (I never interacted with him)

 

I will narrate the chain of events as they happened:

We use the following query to check for the sync:
Query-1
SQL> SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", (ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference" FROM (SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#))ARCH, (SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL WHERE ARCH.THREAD# = APPL.THREAD# ORDER BY 1;
    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                  44878                 44878          0
         2                  15739                 15739          0
My team mate noticed the difference has risen to 7 & 8 in the respective threads

I started investigating and look at the alert log and we had the ora-600 error and also looked at the following query:
Query-2
SQL> SELECT PROCESS, STATUS, THREAD#, SEQUENCE#,BLOCK#, BLOCKS FROM V$MANAGED_STANDBY;
(Sample query output not actual result)
PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK#     BLOCKS
--------- ------------ ---------- ---------- ---------- ----------
ARCH      CONNECTED             0          0          0          0
ARCH      CONNECTED             0          0          0          0
ARCH      CONNECTED             0          0          0          0
ARCH      CONNECTED             0          0          0          0
RFS       IDLE                                 0          0          0          0
RFS       IDLE                                1      44883      41973          1
RFS       IDLE                                0          0          0          0
RFS       IDLE                                0          0          0          0
RFS       IDLE                                0          0          0          0
RFS       IDLE                                0          0          0          0
RFS       IDLE                                2      15743     271919          1
RFS       IDLE                                0          0          0          0
MRP0      WAIT_FOR_LOG     2      15743          0          0
13 rows selected.
Noticed MRP0 process was missing in the actual result
Also verified in alert log:
Errors in file /u01/app/oracle/diag/rdbms/flexprdb/FLEXDRDB1/trace/FLEXDRDB1_mrp0_24466.trc:
ORA-00600: internal error code, arguments: [3020], [3], [346523], [12929435], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 3, block# 346523, file offset is 2838716416 bytes)
ORA-10564: tablespace UNDOTBS1
ORA-01110: data file 3: '+FLEXDATA/flexprdb/datafile/undotbs1.304.742240195'
ORA-10560: block type 'KTU UNDO BLOCK'
MRP0: Background Media Recovery process shutdown (FLEXDRDB1)

Here I had to do something which can be faster than oracle support response; then I recalled Arup’s blog and thought why not take chance...

As alert log messaged saying “ORA-10567: Redo is inconsistent with data block (file# 3, block# 346523, file offset is 2838716416 bytes).
The inference is somehow tablespace UNDOTBS1 having file 3: '+FLEXDATA/flexprdb/datafile/undotbs1.304.742240195' has an issue in recovery.”

Following were the steps I have taken:

1.   Taken RMAN backup of datafile 3 as mentioned in the alert log file from the production having a size of 50 MB backup peace it took around 5 minutes to take the backup.

2.   FTP to DR site it took around 5 minutes

3.    Cataloged the file directory for RMAN restore at Standby database side

4.   Shutdown the standby database by command  “shutdown immediate” and it was not shutting down (its almost 10 minutes) for some reason , I saw the following message:
Fri Feb 18 16:54:14 2011
SHUTDOWN: Active processes prevent shutdown operation
Probably I did not use the command “ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL” Still puzzling as MRP0 is down and its taking so much time to shutdown!!!

5.   Took decision to give command “SHUTDOWN ABORT”

6.   RMAN> STARTUP MOUNT;

7.   RMAN> RESTORE DATAFILE 3;
Alert log message:
Fri Feb 18 16:56:26 2011
ERROR: failed to establish dependency between database FLEXPRDB and diskgroup resource ora.FLEXDATA.dg
Fri Feb 18 17:01:14 2011
Full restore complete of datafile 3 +FLEXDATA/flexprdb/datafile/undotbs1.304.742240195.  Elapsed time: 0:04:05
checkpoint is 6808181049
last deallocation scn is 6808094618
Undo Optimization current scn is 6807046599
Fri Feb 18 17:01:24 2011
RFS[1]: Assigned to RFS process 8532
RFS[1]: Identified database type as 'physical standby': Client is ARCH pid 16697

8.   ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

9.   Executed the query-1 at SQL prompt I could see the archives were getting applied

10. Also validated with query-2 to find MRP0 is back again – I was very relieved for the success

11. As users were really depended on the active data-guard, I forced to use the following commands to give back the database to users for reports usage
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE OPEN
This process took more than 20-30 minutes as it had around 30 archives to be applied to make it consistent and then followed by:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGGILE DISCONNECT;
SQL> select open_mode from v$database;
OPEN_MODE
--------------------
READ ONLY WITH APPLY

Oracle Version:  Oracle RAC Active data-guard standby database
SQL>  select banner from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production

NLSRTL Version 11.2.0.1.0 - Production

I hope this will help somebody and I appreciate your feedback....

6 comments:

  1. Good Bhavani, I can Visualize how u did it... Thanks for sharing...

    ReplyDelete
  2. As per above note you told :- "I forced to use the following commands to give back the database to users for reports usage"

    as you are using 11G you would have used active dataguard:-

    recover managed standby database cancel;
    alter database open read only;
    recovery managed standby database disconnect using current logfile;

    ReplyDelete
  3. Thanks for the feedback and I agree with you.
    -Bhavani

    ReplyDelete
  4. Wow! Thanks, Bhavani for the reference to my blog post. I have to admit when I wrote it, I thought of only one use case - to refresh standby after a long desync. I didn't think about the other use cases such as redo entry being corrupt. That's a brilliant use of this incremental RMAN apply.

    ReplyDelete
  5. Good Bhavani, Thanks for sharing.

    ReplyDelete
  6. Hi Bhavani-I am trying to understand how the standby catches up with the primary after having restored the datafile in question. I mean what mechanism does it use to sync up after the datafile restore to get in sync with the Primary. Appreciate your feedback!

    ReplyDelete