The Eric S. Emrick Blog: February 2006

Tuesday, February 28, 2006

LogMiner and Dynamic Performance Views in Harmony

I want to revisit a topic covered in my paper The Oracle Redo Generation. More specifically, the notion of determining a transaction’s redo size from LogMiner output. A few folks have emailed me with questions so I think a clarification post is warranted, as I anticipate others might have similar questions.

Like I state in the paper, you can use v$mystat/v$sesstat dynamic performance views or LogMiner to discern the volume of redo a transaction generates. These two measures should always agree based on my testing. Let’s examine a very simple transaction that attempts to perform an insert into a test table t but fails on a unique constraint. I chose a statement that fails a constraint to exhibit the undo replay in the redo stream. We will list the redo size data from v$mystat before and after the transaction.

eric@10g:SQL> desc t
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 COL1                                      NOT NULL NUMBER
 COL2                                               NUMBER

eric@10g:SQL> select * from t;

      COL1       COL2
---------- ----------
1 1

eric@10g:SQL> select constraint_name, constraint_type, status, deferrable,
rely from user_constraints where table_name = 'T';

CONSTRAINT_NAME                C STATUS   DEFERRABLE     RELY
------------------------------ - -------- -------------- ----
SYS_C0010643                   P ENABLED  NOT DEFERRABLE


eric@10g:SQL> select 'REDO_START',
(select value from v$mystat where a.statistic#=statistic#) value
from v$statname a  where name = ('redo size');

'REDO_STAR      VALUE
---------- ----------
REDO_START          0

eric@10g:SQL> select 'ENTRY_START',
(select value from v$mystat where a.statistic#=statistic#) value
from v$statname a  where name = ('redo entries');

'ENTRY_STAR      VALUE
----------- ----------
ENTRY_START          0

eric@10g:SQL> insert into t values (1,1);
insert into t values (1,1)
*
ERROR at line 1:
ORA-00001: unique constraint (ERIC.SYS_C0010643) violated

eric@10g:SQL> select 'REDO_STOP',
(select value from v$mystat where a.statistic#=statistic#) value
from v$statname a  where name = ('redo size');

'REDO_STO      VALUE
--------- ----------
REDO_STOP        808

eric@10g:SQL> select 'ENTRY_STOP',
(select value from v$mystat where a.statistic#=statistic#) value
from v$statname a  where name = ('redo entries');

'ENTRY_STO      VALUE
---------- ----------
ENTRY_STOP          3

eric@10g:SQL> select sequence#, status from v$log;

 SEQUENCE# STATUS
---------- ----------------
       272 ACTIVE
       273 CURRENT
       271 INACTIVE

eric@10g:SQL> alter system switch logfile;

System altered.

eric@10g:SQL> select sequence#, archived from v$log where sequence#=273;

 SEQUENCE# ARC
---------- ---
       273 YES

eric@10g:SQL> connect / as sysdba
Connected.

sys@10g:SQL> EXECUTE DBMS_LOGMNR.ADD_LOGFILE(LogFileName =>
'/tmp/eric_273.arc', Options => dbms_logmnr.ADDFILE);

PL/SQL procedure successfully completed.

sys@10g:SQL> EXECUTE DBMS_LOGMNR.START_LOGMNR(OPTIONS => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);

PL/SQL procedure successfully completed.

sys@10g:SQL> select scn,data_obj#,substr(operation,1,20) OP,
rbablk,rbabyte from v$logmnr_contents;

            SCN  DATA_OBJ# OP                       RBABLK    RBABYTE
--------------- ---------- -------------------- ---------- ----------
       10237995          0 START                         2         16
       10237995      64105 INSERT                        2         16
       10237995      64105 DELETE                        3         80
       10237996          0 ROLLBACK                      3        248

sys@10g:SQL> alter system dump logfile '/tmp/eric_273.arc';

System altered.

sys@10g:SQL> exit

Okay, let’s review what we see. The v$mystat dynamic performance view claims that 808 bytes of redo has been generated from this failed transaction. Moreover, v$mystat claims our user session has created 3 redo entries. Let’s reconcile this information with the output from LogMiner. The only change vectors in the mined file are those associated with my transaction as I was the only user on the database. I switched into the mined log before starting my transaction and switched out immediately after.

The test platform OS block size is 512 bytes. Therefore, each redo block (RBABLK) is comprised of 512 bytes. The RBABYTE (OFFSET) value represents the offset into the RBABLK where a particular redo record begins. The first 16 bytes of redo in each RBABLK are reserved for header information. This limits the actual space for transaction redo entries to 496 bytes per RBABLK.

If we walk through each entry we should be able to calculate the 808 bytes of redo v$mystat claims my session created. The START and INSERT redo records are the very same redo record. LogMiner simply gives us the START line for our benefit. It comes in handy when reviewing a log file with many transactions. This means we should not double count (indeed even hard to do given the method of calculation) the redo, but only consider the INSERT redo record as our “redo start” position. LogMiner tells us that the INSERT statement consumes redo from RBABLK=2 OFFSET=16 and spans through RBABLK=3 OFFSET=79. To calculate the redo owed to the INSERT redo record we add the redo owed to the INSERT in RBABLK=2 to that in RBABLK=3. That is, 496 bytes for RBABLK=2 and (80-16=64) bytes for RBABLK=3. Therefore, the total redo attributed to the INSERT redo record is 496 + 64 = 560 bytes.

Why do we see a DELETE operation? Well, remember the INSERT failed a uniqueness constraint and needed to be rolled back. The DELETE operation is the undo portion of this transaction. It is easy to see that the DELETE operation accounts for only (248-80=168) bytes. This means our INSERT and DELETE operations comprise 560 + 168 = 728 bytes. We can then deduce that the ROLLBACK entry comprises the remaining 80 bytes. However, let’s take a slightly different approach so that we can be 100% certain our ROLLBACK marker is indeed 80 bytes.

I used a Solaris egrep command to scrape the log file dump and get at the redo record sizes.

/opt/oracle/admin/10g/udump # egrep 'REDO RECORDCHANGE' 10g_ora_10436.trc
REDO RECORD - Thread:1 RBA: 0x000111.00000002.0010 LEN: 0x0230 VLD: 0x05
CHANGE #1 TYP:0 CLS:17 AFN:2 DBA:0x00800009 OBJ:4294967295 SCN:0x0000.009c3824 SEQ:  1 OP:5.2
CHANGE #2 TYP:0 CLS:18 AFN:2 DBA:0x00800a7f OBJ:4294967295 SCN:0x0000.009c3823 SEQ:  1 OP:5.1
CHANGE #3 TYP:0 CLS: 1 AFN:4 DBA:0x01000056 OBJ:64105 SCN:0x0000.009c3805 SEQ:  2 OP:11.2
CHANGE #4 MEDIA RECOVERY MARKER SCN:0x0000.00000000 SEQ:  0 OP:5.19
REDO RECORD - Thread:1 RBA: 0x000111.00000003.0050 LEN: 0x00a8 VLD: 0x01
CHANGE #1 TYP:0 CLS: 1 AFN:4 DBA:0x01000056 OBJ:64105 SCN:0x0000.009c382b SEQ:  1 OP:11.3
CHANGE #2 TYP:0 CLS:17 AFN:2 DBA:0x00800009 OBJ:4294967295 SCN:0x0000.009c382b SEQ:  1 OP:5.11
REDO RECORD - Thread:1 RBA: 0x000111.00000003.00f8 LEN: 0x0050 VLD: 0x01
CHANGE #1 TYP:0 CLS:17 AFN:2 DBA:0x00800009 OBJ:4294967295 SCN:0x0000.009c382b SEQ:  2 OP:5.4

Each redo record has a LEN attribute denoting the length of the redo record in hexadecimal format. The redo byte address or RBA is also given. This can be used to cross reference the RBABLK and RBABYTE information from the LogMiner output to ensure we are getting at the correct data. The redo byte address contains the log sequence#, redo block number and the redo byte offset. Note the first redo record is positioned at redo block 0x2=2 and redo byte 0x0010=16. This corresponds with the first redo record from the LogMiner output RBABLK=2, RBABYTE=16. The LEN attribute for this record is 0x0230=560. This also matches the 560 bytes calculated via the LogMiner output for the INSERT redo record. The next redo record is DELETE operation with 0x00a8=168 bytes. The last redo record at redo block 0x3=3 and redo byte 0x00f8=248 is the ROLLBACK entry. The LEN attribute for this redo record is 0x0050=80. The ROLLBACK redo record is now shown to be 80 bytes that we deduced from LogMiner.

One final comment on this topic, notice there are 3 redo records that constitute this transaction’s redo footprint. I displayed the redo entries statistic before and after the transaction to show that a redo record is synonymous with a redo entry catalogued in the v$mystat/v$sesstat/v$sysstat dynamic performance views. We hear a lot about redo records and change vectors when Oracle redo is mentioned in documentation. Redo records are comprised of one or more change vectors. An SCN is associated with a redo record and not the individual change vectors directly. Additionally, many redo records can have the same SCN.

I apologize for this being such a cryptic read. I promise I perpetually strive to eschew obfuscation ;-)

Friday, February 24, 2006

The Pleasure of Finding Oracle Things Out

Yesterday I was speaking with a colleague about various technical topics (as we frequently do) and he brought up a good question. He remembered reading that upon the issuance of the ALTER TABLESPACE END BACKUP command Oracle creates a redo record with the begin backup SCN and he wanted to know why Oracle needed to put this SCN in the redo record? More specifically, why doesn’t Oracle create the end backup marker in the redo stream with just the file number? Then during recovery, Oracle could simply read this redo record and consider all “hot-backup-necessary” redo for that file applied?

That seemed like a fair question. I told him that I had seen this begin backup SCN in a log file dump associated with an END BACKUP statement. So, I knew the information he read is correct. However, I could not immediately give him an answer for this mechanic. Of course, the fact that I could not initially get my mind around the mechanic and formulate a response other than, “I don’t really know” bothered me. I assume it is fairly common knowledge that all redo between the BEGIN BACKUP and END BACKUP commands is required for recovery; every good backup and recovery guide tells us this. But, I really didn’t know why the begin backup SCN was required in the redo record created from the END BACKUP statement.

I took pen to paper and reviewed what I knew of the END BACKUP statement.

When issued it tells Oracle to unfreeze the frozen section of the header. Notice I said section. The entire header of each affected file is not immutable during the course of a user-managed hot backup. For instance, the hot backup SCN and checkpoint counter in the header is moving in accord with the database changes to the file(s).
Oracle updates the master checkpoint SCN in the file header and control file with the hot backup SCN (the current SCN for the file). Remember a file in backup mode is still subject to normal modifications, checkpoints, etc.
The status in the file header goes from 0x1 (hot backup mode) to 0x4 (normal online mode).
A redo record for each file in the tablespace is created with the file number and begin backup SCN.
During recovery after the end backup redo record is encountered the file in question has its status updated from 0x1 (hot backup mode) to 0x0 (consistent mode).

Here is an example of a redo record created for a file affected by the END BACKUP statement. I have bolded the begin backup SCN.

REDO RECORD - Thread:1 RBA: 0x000030.00000002.0010 LEN: 0x0038 VLD: 0x01
SCN: 0x0000.004b0b57 SUBSCN: 1 02/23/2006 15:59:59
CHANGE #1 MEDIA RECOVERY MARKER SCN:0x0000.00000000 SEQ: 0 OP:17.1
End backup marker - file:3 scn: 0x0000.004b0b4c

Okay, let me return to my colleague’s question. Why does Oracle place the begin backup SCN in the redo record and not just the file number? Consider the following:

At time y we take a complete hot backup of our production database. Of course we cannot instantaneously take a backup at a single point in time as it takes some time to get all tablespaces in hot backup mode. But, for the sake of this example let y be the “time” we associate with our most recent backup of the database. We also have complete hot backups of our database at times y-n, and y-2n where n is a day, week or some other arbitrary time interval.

What happens if we restore all but one of our files from the backup taken at y. Maybe, we had a corrupt version restored for one of the files and we had to restore a copy from the y-2n backup. That is, we have one file restored from y-2n and the rest from y. Our recovery would start using log files from y-2n as that is the oldest file in our restored database. In our redo stream we have redo records for the END BACKUP commands for the files backed up during the y-2n backup. Remember, each of these redo records contains this mysterious begin backup SCN. While rolling forward Oracle would hit these redo records from the y-2n complete hot backup. What would happen if Oracle simply put the file number in the redo record for the END BACKUP command? It would immediately update the file headers to 0x0 (file is consistent and needs no further “hot-backup-necessary” redo) for the files backed up at y. Well, this would cause major issues. The recovery would stop after all redo records containing the end backup marker from the y-2n backup had been encountered. However, Oracle provided another level of protection from this scenario, the begin backup SCN.

From the previous scenario, with the begin backup SCN embedded in the redo record for each file, Oracle would continue to recover through the necessary redo. It would consider a file consistent only when it found the redo record for the end backup statement that contained the begin backup SCN that corresponds with the begin backup SCN stored in the file header. The begin backup SCN is a portion of the frozen section of the file header when the BEGIN BACKUP command was issued.

All is well, now that it feels I have gotten my mind around this one.

Sunday, February 19, 2006

Oracle, It Leads By Example (a small rant)

Reliable

1 : suitable or fit to be relied on : dependable
2 : giving the same result on successive trials

“To err is human...”

What is really meant by this? Is it a conscious awareness and blatent admission by humanity that we are less than perfect? Is it the standard appeal to the listener’s pathos? From a reliability perspective maybe we are saying humans are only N nines reliable. Wouldn’t that be an interesting metric to use when interviewing a candidate for a DBA position? I think so.

Interviewer: “Great. Great. You have a nice understanding of latching and appear to have a good grasp on diagnosing systemic and localized database performance issues. Very impressive.”

Interviewee: “Thank you very much. I read a lot of technical material. It keeps me sharp and fresh.

Interviewer: “However, I must admit there is one issue.”

Interviewee: “Oh. Okay. What is it?”

Interviewer: “Yes, it says here on your reliability report that you are only 2 nines reliable.”

Interviewee: “Dangit! You get those reports?”

Well, as it turns out, today I had an experience in my non-occupational life where a person was less than reliable. Admittedly, reliability is a pet peeve of mine. I couldn’t help but draw a quick analogy for my wife when expressing my dissatisfaction. I told her that sometimes I wish people were as dependable as the Oracle databases I support. She laughed of course. But, I meant it! Hmmm, maybe that is why she laughed.

The Oracle relational database management system is indeed a very reliable and dependable thing. Sure Oracle has imperfections we like to call bugs and it is certainly not perfect, but if left to its own devices it is an amazingly reliable and resilient system. I have, on several occasions in my career, seen the Oracle RDBMS “outlast” the OS on which it operates. For example, I can recall a simple reboot that eliminated systemic latch contention. However, this measure was only taken AFTER the database had been bounced several times; each time the database would escalate to the same condition. This is probably a good topic for a separate post.

Consider for a moment the complexity of the Oracle RDBMS. In Oracle10g it can manage your files and storage automatically (ASM), essentially a file system and a volume manager. Wait, it gets better. It can also automatically distribute the files in an attempt to balance I/O for optimal performance. Moreover, while tending to our storage needs it simultaneously performs the vastly more complicated tasks of memory management, inter-process communication, queue management and resource concurrency through an extremely intricate binary dance of code and control structures. All this is required for me to simply ask the database a question. Just crack open the Oracle Concepts Guide and you’ll soon be amazed at the high-level management performed by the system. There is no need swimming through the thousands of pages comprising the complete documentation set to get the gist. This mountain of complexity notwithstanding, the Oracle database is incredibly reliable and can weather the punishment of the most poorly written applications for weeks, months or even years.

In all fairness to humanity we are systems that are orders of magnitude more complex than the Oracle RDBMS. The ability to predict the behavior of a system, I suspect, is increasingly inhibited by the complexity therein, e.g. quantum mechanics. I am sure this sentiment is not lost on your local weather authority. So maybe I shouldn’t hold people to a reliability standard analogous to that of the Oracle RDBMS. But, in which release will Oracle fix my furnace?

By the way, “…to forgive is divine.”

Thursday, February 16, 2006

Reliable Hot Backups Without Oracle Intervention?

In most cases when we need to take a hot backup of an Oracle database we employ either the user-managed hot backup approach using the ALTER TABLESPACE BEGIN BACKUP facility or we simply take an RMAN backup. Of course, there are measures that can be taken to hot backup your database using hardware mirrors by enlisting a combination of the ALTER SYSTEM SUSPEND, ALTER TABLESPACE BEGIN BACKUP commands and vendor specific mirror control interfaces. However, each of these approaches has a downside.

ALTER TABLESPACE BEGIN BACKUP

1. Redo generation overhead and potential redo infrastructure contention.
2. Backup processing consumes production server resources.
3. To minimize performance impacts to production iterating through your tablespaces can take quite some time.

RMAN

1. Backup processing consumes production server resources.
2. Careful planning to ensure the RMAN metadata is maintained, i.e. frequent control file and/or RMAN catalog backups.
3. Additional layer of backup and recovery abstraction.

ALTER SYSTEM SUSPEND and Hardware Mirrors

1. Redo overhead associated with the recommended ALTER TABLESPACE BEGIN BACKUP.
2. All tablespaces must be in backup mode prior to splitting; once again redo overhead.
3. Suspension of all I/O causing immediate outages to application processing until your mirror is split. While this suspend/split might only take minutes, some 24x7x365.25 applications cannot afford zero application activity on a nightly or semi-nightly basis.
4. The cost of a third mirror. Ideally we don’t want to split our primary mirror as we immediately become susceptible to media failure during the time the split mirror is out of sync with your production database.

It would be nice to avoid as many of these side effects as possible in your backup strategy. That is, could we take a "hot backup" that 1) Does not require our tablespaces to be placed in hot backup mode 2) Does not required ALL application data to be suspended during a portion of the backup phase and 3) does not require another layer of backup and recovery abstraction via additional Oracle metadata maintenance?

Yes. EMC has a really good suite of products in their TimeFinder [tm] solution set that can accommodate. Within the TimeFinder solution set is the notion of a Composite Group [tm]. The Composite Group can be defined for the set of primary devices comprising your production database paired with Business Continuance Volumes (BCV), or simply, software controlled mirrors. When you synchronize your primary devices with their respective paired BCV devices and initiate a consistent split operation the EMC subsystem suspends WRITE operations to your primary devices only for the duration of the split operation. However, the split is for a consistent point in time and typically takes less than a few seconds. In recent versions of the TimeFinder product read activity is permitted to flow. This differs from the ALTER SYSTEM SUSPEND approach in that Oracle cannot guarantee the immediate termination of I/O and the read activity is necessarily suspended. The end result of this consistent split is an “aborted” database on the mirror.

How do we convert this “aborted” image of the database on the mirror to a database that is meaningful for backup and recovery? Startup the database in mount mode and issue the RECOVER DATABASE command. Remember, the online redo logs are preserved in the synchronization process to the same point in time as the data files and control files. The RECOVER DATABASE command makes the database consistent with respect to a single point in time. As a matter of fact, the file headers have consistent stop SCNs and a file status flag of(0x0). Consequently, your mirror taken while the database was up and functional (1-3 seconds of write suspension) has been converted to a consistent backup. Please note, if you intend to backup your database(s) from the mirror devices, you cannot open the database on said devices as that would rollback transactions that, in production, might have been committed. Naturally, there are license fees associated with TimeFinder and the third mirror needs to be purchased, among other considerations. But these additional expenditures can be evaluated against the benefits of the solution to determine if the product is worth the investment.

This stuff really works and is very reliable!

Thursday, February 09, 2006

ORA-00001.5: unique constraint NOT violated

Recently, I came across a very interesting “feature” in Oracle – prescience. Yes indeed, this newly discovered and bizarre behavior of Oracle borders on precognition. Just when I thought Oracle only knew how to operate in the now, and how to deal with the past, it can tell the future! Okay already, enough with the melodrama and on to the experience.

I was testing a particular multi-row update scenario in an attempt to discern Oracle’s approach to managing the redo. I used LogMiner and a log file dump to help understand this mechanic. The test involved a heap table with two columns and a primary key enforced using a unique index. The table was loaded with ten rows.


eric@erice:SQL> desc parent_table_1
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
COL1                                      NOT NULL NUMBER
COL2                                               NUMBER

eric@erice:SQL> select * from parent_table_1;

      COL1       COL2
---------- ----------
         0          1
         1          1
         2          1
         3          1
         4          1
         5          1
         6          1
         7          1
         8          1
         9          1

10 rows selected.

eric@erice:SQL> select index_name from user_indexes where table_name = 'PARENT_TABLE_1';

INDEX_NAME
------------------------------
PK_PARENT_TABLE_1

eric@erice:SQL> select constraint_name, constraint_type, deferrable, validated, rely, index_name
from user_constraints where table_name = 'PARENT_TABLE_1';


CONSTRAINT_NAME                C DEFERRABLE     VALIDATED     RELY INDEX_NAME
------------------------------ - -------------- ------------- ---- ----------------------
PK_PARENT_TABLE_1              P NOT DEFERRABLE VALIDATED          PK_PARENT_TABLE_1

I wanted to understand the redo mechanics as it related to constraint validation for a multi-row update against PARENT_TABLE_1 using the following simple update statement:


eric@erice:SQL> update parent_table_1 set col1=col1+1;

10 rows updated.

eric@erice:SQL> commit;

Commit complete.

eric@erice:SQL> select * from parent_table_1;

      COL1       COL2
---------- ----------
         1          1
         2          1
         3          1
         4          1
         5          1
         6          1
         7          1
         8          1
         9          1
        10          1

10 rows selected.

eric@erice:SQL> select * from table(dbms_xplan.display());

------------------------------------------------------------------------
 Id   Operation              Name            Rows   Bytes  Cost
------------------------------------------------------------------------
   0  UPDATE STATEMENT
   1   UPDATE               PARENT_TABLE_1
   2    TABLE ACCESS FULL   PARENT_TABLE_1
------------------------------------------------------------------------

I must admit, when I saw all rows updated sans unique constraint violation I was taken aback. This statement should have failed after the first row was processed, right? I ran the test again with the same results. I then interrogated the data dictionary looking for some database condition that might rationally explain this behavior. I was left staring at the screen having yielded my intuition to what must be a quirk, some kind of bug or my ignorance. How did Oracle know that my statement would eventually yield a set of unique keys? Not all was lost. Oracle couldn’t hide the redo!

Log file dumps are not pretty and take a couple “reads” before you get a handle on what is happening. To display all the applicable dump here would consume more space than desired. Thankfully, LogMiner can format much of the information for us.


eric@erice:SQL> select scn,rel_file#,data_blk#,data_obj#,substr(operation,1,20) OP from v$logmnr_contents;

            SCN  REL_FILE#  DATA_BLK#  DATA_OBJ# OP
--------------- ---------- ---------- ---------- --------------------
        4571216          0          0          0 START
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          3        554       6480 UPDATE
        4571216          0          0       7149 INTERNAL
        4571216          0          0       7149 INTERNAL

eric@erice:SQL> select object_name, object_id from user_objects where object_id in ('6480','7149');

OBJECT_NAME                     OBJECT_ID
------------------------------ ----------
PK_PARENT_TABLE_1                    7149
PARENT_TABLE_1                       6480

From the LogMiner data we can see that Oracle performs alternating modifications on PARENT_TABLE_1 and PK_PARENT_TABLE_1 with two consecutive index modifications at the end. I took the liberty to “scrub” the log file dump using a Solaris egrep command. Annotations were added for the first few entries and last few entries to depict the pattern.


> egrep '^colindex redokeydatakey :' log_file.dmp
col  0: [ 1]  80    (table undo) 80 == 0 in the log file dump.
col  0: [ 2]  c1 02    (table redo) change 0 -> 1
key :(2):  01 80        (index undo) Index key value of 0.
keydata/bitmap: (6):  00 c0 02 2a 00 00   (index undo) old rowid data for key 0.
index redo (kdxlde):  delete leaf row    (index redo)
col  0: [ 2]  c1 02    (table undo) old value is 1
col  0: [ 2]  c1 03    (table redo) new value is 2
update keydata in row    (index undo)
key :(3):  02 c1 02    (index undo) Index key value of 1.
keydata/bitmap: (6):  00 c0 02 2a 00 01  (index undo) rowed data for key 1.
index redo (kdxlpu): update keydata, count=3 (index redo) update keydata operation
keydata : (6):  00 c0 02 2a 00 00  (index redo) new rowid data for key 1.
col  0: [ 2]  c1 03    (table undo) old value is 2.
col  0: [ 2]  c1 04    (table redo) new value is 3.
update keydata in row    (index undo)
key :(3):  02 c1 03    (index undo) Index key value of 2.
keydata/bitmap: (6):  00 c0 02 2a 00 02  (index undo) rowid data for key 2.
index redo (kdxlpu): update keydata, count=3 (index redo) update keydata operation
keydata : (6):  00 c0 02 2a 00 01  (index redo) new rowid data for key 2.
col  0: [ 2]  c1 04    (table undo) old value is 3.
col  0: [ 2]  c1 05    (table redo) new value is 4.
update keydata in row
key :(3):  02 c1 04
keydata/bitmap: (6):  00 c0 02 2a 00 03
index redo (kdxlpu): update keydata, count=3
keydata : (6):  00 c0 02 2a 00 02
col  0: [ 2]  c1 05
col  0: [ 2]  c1 06
update keydata in row
key :(3):  02 c1 05
keydata/bitmap: (6):  00 c0 02 2a 00 04
index redo (kdxlpu): update keydata, count=3
keydata : (6):  00 c0 02 2a 00 03
col  0: [ 2]  c1 06
col  0: [ 2]  c1 07
update keydata in row
key :(3):  02 c1 06
keydata/bitmap: (6):  00 c0 02 2a 00 05
index redo (kdxlpu): update keydata, count=3
keydata : (6):  00 c0 02 2a 00 04
col  0: [ 2]  c1 07
col  0: [ 2]  c1 08
update keydata in row
key :(3):  02 c1 07
keydata/bitmap: (6):  00 c0 02 2a 00 06
index redo (kdxlpu): update keydata, count=3
keydata : (6):  00 c0 02 2a 00 05
col  0: [ 2]  c1 08
col  0: [ 2]  c1 09
update keydata in row
key :(3):  02 c1 08
keydata/bitmap: (6):  00 c0 02 2a 00 07
index redo (kdxlpu): update keydata, count=3
keydata : (6):  00 c0 02 2a 00 06
col  0: [ 2]  c1 09
col  0: [ 2]  c1 0a
update keydata in row
key :(3):  02 c1 09
keydata/bitmap: (6):  00 c0 02 2a 00 08
index redo (kdxlpu): update keydata, count=3
keydata : (6):  00 c0 02 2a 00 07
col  0: [ 2]  c1 0a    (table undo) old value is 9.
col  0: [ 2]  c1 0b    (table redo) new value is 10.
update keydata in row    (index undo)
key :(3):  02 c1 0a    (index undo) Index key value of 9.
keydata/bitmap: (6):  00 c0 02 2a 00 13  (index undo) rowid data for key 9.
index redo (kdxlpu): update keydata, count=3 (index redo) update keydata operation.
keydata : (6):  00 c0 02 2a 00 08  (index redo) new rowid data for key 9.
key :(3):  02 c1 0b    (index undo) Index key value of 10.
index redo (kdxlin):  insert leaf row  (index redo)
keydata: (6):  00 c0 02 2a 00 13  (index redo) new rowid data for key 10.

As you can see, Oracle has taken the following approach in managing the index while performing this update.

1. Update table data (col0) for row 1 from 0 to 1.
2. Delete leaf row for index entry with key value of 0.
3. Update table data (col0) for row 2 from 1 to 2.
4. Update the key data for the index entry that has a key value of 1 with the rowid of the row updated by step 1.
5. Update table data (col0) for row 3 from 2 to 3.
6. Update the key data for the index entry that has a key value of 2 with the rowid of the row updated by step 3.
....
Continue associating rowid values for the index entries that correspond to the new table values until the final row is to be updated.
....n. After the last row has been updated to a key value that is not already represented by a current index entry, insert a leaf row with the address of the table row updated by step n-1.

Viola! Oracle prevented index key value collisions by taking special measures. This explains what Oracle has done and how Oracle has accomplished it. But, how did Oracle know to do this? Certainly the execution plan for this statement does not give us any insight into Oracle’s index maintenance stategy. It is doing a full table scan on the test table as expected and index maintenance is under the covers. However, this does not mean that the code does not behave differently when parsed under different “conditions.” For example, what if there was not a unique index on COL1, but instead a non-unique index and no primary key constraint. Would that change Oracle’s approach to index maintenance? It turns out that it does indeed. If the conditions are arranged as just mentioned, Oracle performs its "typical" index update maintenance (delete leaf row followed by an insert leaf row) taking no special measures to account for potential index key collisions. Why should it? There isn’t any data condition (i.e. unique index) to tell Oracle to take any precautions.

Wait, have we answered the question “how did Oracle know to do this”? It certainly appears so. Oracle was aware that I was attempting to update the key value associated with a unique index and took an optimistic approach to the index maintenance; giving me the benefit of the doubt that the resultant index entries would still be unique. Oracle did not evaluate my SQL statement and devise an index maintenance plan that was “friendly” based on the excellent prospect that COL1=COL1+1 would ultimately generate a set of unique entries. To verify this I tested the same behavior with the update condition COL1=mod(COL1,2) and Oracle took the same "friendly" approach. That is, Oracle updated 5 table rows before failing. It wanted to accommodate but apparently a tolerance threshold was met. The index and table maintenance in this test was the following:

1. No action was required on index entries for the first two table row modifications. Oracle was okay with modifying 0->0 and 1-> 1 without affecting the index. That is pretty cool!
2. On the third row COL1 was updated in the table to a value of 0. Well, that was a collision so Oracle only deleted the leaf row associated with the key value of 2 (third table row).
3. On the fourth row COL1 was updated in the table to a value of 1. Once again this was a collision so Oracle only deleted the leaf row associated with the key value of 3 (fourth row).
4. On the fifth row COL1 was updated in the table to a value of 0. On this collision Oracle has seemingly "had enough" and rolled back the transaction after failing with an ORA-00001.

In summary, Oracle has simply taken the code path associated with maintaining a unique index given the current execution plan. No fortune telling or precognition, just good schema awareness.

Tuesday, February 07, 2006

A Brief History of Oracle Time

Recently, while in my vehicle on the way to work I was lulled into a deeper-than-usual thought stream thanks to a clog of traffic. It was then that I entered into an Oracle cosmological musing. I was pondering the notion of time inside Oracle. Outside of the Oracle database the idea of concurrency, and thusly simultaneity, within the database appear abundantly unambiguous. That is, from the outside different transactions within the same database can affect change on the database simultaneously. In the world outside of Oracle the idea of simultaneity outside of Oracle is relative. My chronicle on the time an event occurs, such as typing this article, might differ substantially from another’s due to my position, motion, etc. We can thank Albert Einstein for this discovery. However, Oracle’s notion of its time and simultaneity is a bit different.

Consider the fact that Oracle only chronicles changes to the database. Yes, even changes created by queries, e.g. delayed block cleanout. Therefore, we are only concerned with Oracle time as it relates to modifications. It we were to rewind Oracle time, via a database restoration, and then roll Oracle time forward we would only see the affects of change being replayed.

How is change affected in the database? We know that all transactions contend for the lone redo allocation latch, per redo thread, to affect their respective changes. Until the redo allocation latch has been acquired a transaction cannot produce change and therefore cannot make any contribution to Oracle time. Once the redo allocation latch has been acquired a transaction can then contribute to Oracle time, its position in Oracle time henceforth solidified. After getting some external time on a redo copy latch and copying its change into the redo buffer the transaction has made an indelible contribution (barring instance failure) to time in Oracle. Although the change is forever imprinted in Oracle time it does not mean the change will be perpetually visible, e.g. an uncommitted change during an instance crash.

How is Oracle time measured? Every change made to the database is chronicled using a system change number (SCN). The SCN is Oracle’s time quantum. We cannot recover to a position in Oracle time to a more granular point than a specific SCN. Is an SCN unique to a particular change? Not necessarily. Many changes from distinct transactions can comprise a single SCN. So, simultaneity in Oracle can be defined as the set of all redo records that are recorded for a single SCN. Therefore, simultaneity in Oracle time is not relative as your change either occurred at an absolute point in Oracle time T or it did not.

How does Oracle map external time to its internal time? What are we instructing Oracle to do when we tell it to recover to a particular point in external time? You can view external time as an index into Oracle time. Each redo record has an external time associated with it. If I want Oracle to recover to an external point in time E, it resolves E to an SCN such that this SCN has the highest external timestamp that precedes E.

Notice, Oracle time is not a continuum. It cannot be arbitrarily divided into equal units. Relativistic time can be arbitrarily divided into seconds, minutes, hours, days, etc. Oracle time is discontinuous because it is necessarily divided along the bounds of each indivisible SCN. Oracle time is analogous to a movie being played at your local theatre. Each frame on the tape can be the analog to an Oracle SCN. For two successive frames many changes on the movie screen are possible. Some frames can contain more change (altered "pixels") than the previous frame. For example, a man pauses to think versus children riding bikes and people crossing streets. So the next time you sit down with a tub of popcorn at the movies whisper to your friend that what he/she is observing is analogous to Oracle time. Be prepared for a strange look in return :)

Thursday, February 02, 2006

Backup Control Files - Are they special?

I read an article today that talked about backup control files. Typically, DBAs don’t deal with backup control files on a daily basis so I thought I would take a look. This article asserts that Oracle requires a RESETLOGS operation after performing recovery using a backup control file because –

Oracle needs to update certain structures in the backup control file before opening the database

I certainly agree that Oracle will update the control file when opening the database and resetting the logs. But, let's try to understand what is updated and why. Before we talk about the criticality of the RESETLOGS post-recovery, using a backup control file, let’s try to define what a backup control file really is. Better yet, let’s first list what a backup control file is NOT:

· An OS image of the control file taken while the database is closed.
· An OS image of the control file taken while the database is open (I just shuddered).
· An OS image of the control file taken while the database is mounted.
· A trace backup of the database control file (not unless a can of tomato sauce, some dough and grated cheese is a pizza).

A backup control file is an image of the database control file. But this image possesses very distinctive qualities.

· It contains a file type flag (value=4) that tells Oracle it is a backup control file.
· It is consistent with respect to a single point in time – thus the file type flag for Oracle’s benefit.
· The Stop SCN markers for each data file record are set to 0xffff.ffffffff. To you and I this is hexadecimal for “not available.”

These qualities assure Oracle that the integrity of the backup control file has been preserved. If a database is open and an OS image is taken of the control file it will most likely be corrupt. In contrast, when a backup control file is created using the ALTER DATABASE BACKUP CONTROLFILE TO ‘’ command (or via RMAN) the acquisition of the control file enqueue ensures the file is static during the operation. The file type flag in the backup control file tells Oracle that it cannot (and should not) rely on its redo thread, checkpoint progress and log file records. By instructing Oracle to recover using a backup control file you are telling Oracle to avoid the redo thread records and log file records in the control file – and for good reason.

Well, what about those control file backups to trace – can you create a backup control file using the SQL dumped to trace? Yes. Simply instruct Oracle to RESETLOGS in the CREATE CONTROLFILE statement. The resultant control file will have the same file type flag setting as the backup control file created using RMAN or ALTER DATABASE BACKUP CONTROLFILE TO ‘’ command. You can get at this file type flag data using the oradebug command.

SQL> oradebug setmypid;
SQL> oradebug dump controlf 10;

Be careful during your recovery because you can very easily convert your unsuspecting current control file into a backup control file – destroying the transactional ancestry of the database via the requisite RESETLOGS operation. Consider the following:

sys@erice:SQL> shutdown
Database closed.
Database dismounted.
ORACLE instance shut down.
sys@erice:SQL> startup mount;
ORACLE instance started.

Total System Global Area 110070328 bytes
Fixed Size 731704 bytes
Variable Size 88080384 bytes
Database Buffers 20971520 bytes
Redo Buffers 286720 bytes
Database mounted.
sys@erice:SQL> recover database using backup controlfile;
ORA-00279: change 4379115 generated at 02/01/2006 22:41:36 needed for thread 1
ORA-00289: suggestion : /opt/oracle/dba/erice/erice_arch_1.dbf
ORA-00280: change 4379115 for thread 1 is in sequence #1


Specify log: {=suggested filename AUTO CANCEL}
cancel
Media recovery cancelled.
sys@erice:SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open


sys@erice:SQL> recover database;
ORA-00283: recovery session canceled due to errors
ORA-01610: recovery using the BACKUP CONTROLFILE option must be done

By issuing recovery using a backup control file I have actually converted the current control file to a backup control file. This does not mean that I could not completely recover this database. All I had to do was issue RECOVER DATABASE USING BACKUP CONTROLFILE and supply the appropriate (preserved) online redo log(s).

Back to the original question – why does Oracle require a RESETLOGS operation post-recovery using a backup control file? Within each control file there are checkpoint progress records that provide Oracle with the on-disk RBA (redo byte address). This tells Oracle the offset into the current redo log file(s) that the LGWR has flushed the redo thread.

Let’s suppose Oracle permitted you to open the database without a RESETLOGS after recovery using a backup control file. Then Oracle must use what it knows to be true about its online redo thread: the log file records, redo stream records and checkpoint progress records. If the redo log files remained intact during your restore and recovery then they probably have previously generated redo. If you have 3 online redo log groups and your backup control file states that the on-disk RBA is half way into group 2 and the existing (non-reset) redo log for group 2 is full of previously generated redo, Oracle would start writing redo to its last know on-disk RBA. Hopefully you see where this is going. Potentially, the last half of redo written to group 2 would correspond to changes post-recovery, i.e. you would have a redo log that is corrupt. There could be a major gap in SCNs between the last entry pre-recovery and the first entry post-recovery within the same redo log file. For this reason, Oracle needs to update the redo thread, log file and checkpoint progress records in accord with the new incarnation of the database.

I am sure we could come up with dozens of scenarios where not resetting the logs would cause corruption. But, I have already given too many words to a post that was intended to be short.

The Eric S. Emrick Blog