Tuesday, March 28, 2006

Oracle Riddles: Now that is interesting.

A process can prevent me from staking the claim for another. Users sometimes need me to know if they can secure a spot where another process may have been before. Needless to say, I am a pretty big deal, and users are very interested in me. However, when some stop by for a visit they often feel I haven't cleaned up very well. Do you know what I am?

Recovery of Offline Data Files

You have probably noticed at one time or another that taking a data file offline in a READ/WRITE tablespace necessitates a subsequent recovery operation on that data file, to bring it online again. The same cannot be said for tablespaces under most circumstances. That is, we can offline normal a tablespace (to make a backup, for instance) and recovery is not required to bring the data files comprising the tablespace online. The reason Oracle requires that you recover your data file is Oracle does not checkpoint a data file when we take it offline. Whereas, for a tablespace taken offline "normally" each data file in that tablespace is checkpointed. This checkpoint operation synchronizes the STOP SCN in the control file and START SCN in the file header. For data files taken offline, Oracle only updates the STOP SCN for the control file entry related to that data file. The result is a control file that has a STOP SCN that is in the future of the Checkpoint SCN in the file’s header, and thusly recovery is required.

I don’t think this is terribly enlightening information for most DBAs. However, I have read on more than one occasion that we need to beware of taking data files offline because of the "potential" of having to roll through a vast amount of archive logs to bring it back online. Most items I have read on this topic, in essence, state that you need all of the redo between the offline operation and the current log to fully recover the data file. Assuming the control file mounted by your database is the current control file, this notion is not accurate. The only redo that is required to recover your data file subsequent to an offline operation is that which is generated between the Checkpoint SCN in the file header and the STOP SCN recorded in the control file.

Let’s create a simple mental exercise to understand why this is the case. Then we can go through a quick example using an Oracle database.

If a database is mounted using the current control file then the information that Oracle maintains for the file headers is assumed 100% reliable. When we take a data file offline we are telling Oracle to immediately render the data file inaccessible to the application. Oracle cannot do this without telling the control file WHEN this offline activity was performed (we are not dealing with OFFLINE DROP). By updating the control file STOP SCN for the data file this requirement is satisfied. Remember, the STOP SCN defines the WHEN in Oracle time. If Oracle decided to checkpoint the data file before taking it offline then recovery would not be an issue. Taking a tablespace offline "normally" proves this.

Okay, let’s assume in our single instance database we have 3 redo log groups with the current log at sequence number 100. Furthermore, let’s take an online data file offline. Assume that this offline activity occurs while log sequence 100 is still the current log sequence number. If after we take the data file offline we initiate 5 logs switches, then our current thread will have the current log sequence number at 105. Since we have 3 log groups we know that log sequence numbers (LSN) 100-102 should be archived (and probably even 103 and 104). So, to the question, why would Oracle require us to roll up through log sequence 105 when the data file was taken offline in log sequence number 100? The control file is current and has recorded the STOP SCN. Given the current status of the control file it is not possible that any change vectors with SCNs greater than the STOP SCN are relevant to the data file we took offline in log sequence number 100. Therefore, it seems abundantly reasonable that the only log sequences required for recovery are those between the Checkpoint SCN (LSN<=100) and the STOP SCN (LSN=100). After log sequence number 100 is applied, recovery should be complete. Now, let’s take this mental exercise and apply it to a real database scenario.
sys@10g:SQL> select status from dba_tablespaces
where tablespace_name = 'USERS';

STATUS
---------
ONLINE

sys@10g:SQL> select file_id from dba_data_files
where tablespace_name = 'USERS';

FILE_ID
----------
4

sys@10g:SQL> select sequence# from v$log where status = 'CURRENT';

SEQUENCE#
----------
100

sys@10g:SQL> select distinct group# from v$log;

GROUP#
----------
1
2
3

sys@10g:SQL> alter database datafile 4 offline;

Database altered.

sys@10g:SQL> select sequence#, archived from v$log;

SEQUENCE# ARC
---------- ---
100 NO
98 YES
99 YES

sys@10g:SQL> alter system switch logfile;

System altered.

sys@10g:SQL> /

System altered.

sys@10g:SQL> /

System altered.

sys@10g:SQL> /

System altered.

sys@10g:SQL> /

System altered.

sys@10g:SQL> select sequence#, status, archived from v$log;

SEQUENCE# STATUS ARC
---------- ---------------- ---
103 INACTIVE YES
104 INACTIVE YES
105 CURRENT NO

sys@10g:SQL> alter database datafile 4 online;
alter database datafile 4 online
*
ERROR at line 1:
ORA-01113: file 4 needs media recovery
ORA-01110: data file 4:
'/ora/data001/10g/10G/datafile/o1_mf_users_1pdkdtz5_.dbf'


sys@10g:SQL> recover datafile 4;
ORA-00279: change 12438697 generated at 03/28/2006 13:48:42
needed for thread 1
ORA-00289: suggestion : /ora/arch/10g/1_98_584918376.dbf
ORA-00280: change 12438697 for thread 1 is in sequence #98


Specify log: {=suggested filename AUTO CANCEL}

ORA-00279: change 12438704 generated at 03/28/2006 13:48:44
needed for thread 1
ORA-00289: suggestion : /ora/arch/10g/1_99_584918376.dbf
ORA-00280: change 12438704 for thread 1 is in sequence #99
ORA-00278: log file '/ora/arch/10g/1_98_584918376.dbf'
no longer needed for
this recovery


Specify log: {=suggested filename AUTO CANCEL}

ORA-00279: change 12438707 generated at 03/28/2006 13:48:50
needed for thread 1
ORA-00289: suggestion : /ora/arch/10g/1_100_584918376.dbf
ORA-00280: change 12438707 for thread 1 is in sequence #100
ORA-00278: log file '/ora/arch/10g/1_99_584918376.dbf'
no longer needed for this recovery


Specify log: {=suggested filename AUTO CANCEL}

Log applied.
Media recovery complete.


Here we see that Oracle requested archive logs 98 through 100 for recovery. If we had dumped the file header for data file 4 we would have seen that the last time the file was checkpointed the log sequence number was 98. This is evident by looking at the redo byte address associated with the Checkpoint SCN in the header dump. Nonetheless, we have seen using the example above that Oracle does not require all redo generated between the offline operation and the current log for complete data file recovery. Oracle only required the range of log sequence numbers that corresponded to the Checkpoint SCN (LSN=98) and the STOP SCN (LSN=100). Notice that we cannot assume Oracle has applied all redo between the offline and the current log simply because it stopped asking for my archive logs. If Oracle had really applied all redo between log sequence number 98 and the current redo log, then it would have necessitated the application of log sequence numbers 101 and 102 as they are archived and not present in the online thread. But Oracle did not require these logs. We can rest assured that Oracle has only done what it needed to do based on the information in the current control file and the data file header.

Monday, March 27, 2006

Oracle Riddles: Where oh where?

I create products that are most often undesirable. Maybe a clause in my master plan is where the answer lies. I always cross many obstacles to complete my work. For my namesake, Rene, I behave as I am instructed. What am I?

Friday, March 24, 2006

Media Recovery Locks

What is the deal with all of those strange media recovery locks (v$lock.type=MR) held by the Database Writer process in our Oracle databases? It seems slightly strange that Oracle would have media recovery locks on a database that is open and fully functional. Or does it?

These media recovery locks do serve a meaningful purpose. When a data file is online and read/write the control file and data file states, alone, would not prevent recovery from being run against that data file. Consider the database that has been shutdown with the ABORT option. The state of the file headers and the corresponding control file information for the database files do not change as a result of an ABORT. No checkpoint is taken to update the file headers and no control file updates are made. The instance is simply terminated; the file header and its corresponding control file information appear just as it did when the database was still open. The replay of committed transactions and the rolling back of uncommitted data is later handled by crash recovery and post-open transaction recovery.

However, it is a perfectly legitimate proposition to take an aborted instance, mount it and issue the RECOVER DATABASE command to make the database consistent. In fact, after you issue the RECOVER DATABASE command you can take a backup of the database. So, if I can issue a RECOVER DATABASE command against data files that are in “aborted” status, what would prevent me from invoking the recovery engine while the database was open? You guessed it, the media recovery (MR) locks held by the Database Writer. Some very strange things, indeed, might ensue if we could start recovery against an online file.

It has been my experience, for an open database, that the Database Writer does not relinquish the MR lock for a data file until the file is taken offline with or without the tablespace. With the data file offline, file media recovery is now a legal operation. When a recovery session is created to recover a data file, that process acquires the MR lock for the applicable data file. This makes complete sense, as that is the session responsible for the recovery. Even if the database is only mounted the recovery session holds the MR lock until that recovery session is cancelled or complete. This prevents multiple recovery sessions from doing the same recovery. If RECOVERY_PARALLELISM is set to a non-zero value the MR lock is still held by the SQL*Plus or RMAN session that started the recovery.

Let’s look at a quick example. This really requires two sessions to show the MR lock activity. The sessions are differentiated by the color of the output.

sys@10g:SQL> @test.sql
sys@10g:SQL> select a.sid,
2 substr(a.program,1,25) program,
3 b.file#,
4 c.type
5 from v$session a,
6 v$datafile b,
7 v$lock c
8 where a.sid = c.sid and
9 b.file# = c.id1 and
10 c.type = 'MR';

SID PROGRAM FILE# TY
---------- ------------------------- ---------- --
167 oracle@machine (DBW0) 1 MR
167 oracle@machine (DBW0) 2 MR
167 oracle@machine (DBW0) 3 MR
167 oracle@machine (DBW0) 4 MR
167 oracle@machine (DBW0) 5 MR
167 oracle@machine (DBW0) 6 MR
167 oracle@machine (DBW0) 7 MR
167 oracle@machine (DBW0) 8 MR



8 rows selected.

sys@10g:SQL>
sys@10g:SQL> select file_name
2 from dba_data_files
3 where tablespace_name = 'USERS';

FILE_NAME
--------------------------------------------------------------------------------
/ora/data001/10g/datafile/o1_mf_users_1pdkdtz5_.dbf

sys@10g:SQL>
sys@10g:SQL> alter tablespace users begin backup;

Tablespace altered.

sys@10g:SQL>
sys@10g:SQL> !cp /ora/data001/10g/datafile/o1_mf_users_1pdkdtz5_.dbf

/ora/data001/10g/datafile/o1_mf_users_1pdkdtz5_.dbf.bkp

sys@10g:SQL>
sys@10g:SQL> alter tablespace users end backup;

Tablespace altered.

sys@10g:SQL>
sys@10g:SQL> alter system switch logfile;

System altered.

sys@10g:SQL> alter system switch logfile;

System altered.

sys@10g:SQL> alter system switch logfile;

System altered.

sys@10g:SQL> alter tablespace users offline;

Tablespace altered.

sys@10g:SQL> !cp /ora/data001/10g/datafile/o1_mf_users_1pdkdtz5_.dbf.bkp

/ora/data001/10g/datafile/o1_mf_users_1pdkdtz5_.dbf

sys@10g:SQL> recover tablespace users;
ORA-00279: change 12173563 generated at 03/24/2006 14:49:59 needed for thread 1
ORA-00289: suggestion : /ora/arch/10g/1_87_584918376.dbf
ORA-00280: change 12173563 for thread 1 is in sequence #87


Specify log: {=suggested filename AUTO CANCEL}


sys@10g:SQL> select a.sid,
2 substr(a.program,1,25) program,
3 b.file#,
4 c.type
5 from v$session a,
6 v$datafile b,
7 v$lock c
8 where a.sid = c.sid and
9 b.file# = c.id1 and
10 c.type = 'MR';

SID PROGRAM FILE# TY
---------- ------------------------- ---------- --
167 oracle@machine (DBW0) 8 MR
167 oracle@machine (DBW0) 2 MR
167 oracle@machine (DBW0) 3 MR
167 oracle@machine (DBW0) 1 MR
167 oracle@machine (DBW0) 5 MR
167 oracle@machine (DBW0) 6 MR
167 oracle@machine (DBW0) 7 MR
153 sqlplus@machine (TNS V1- 4 MR

8 rows selected.


Log applied.
Media recovery complete.


sys@10g:SQL> select a.sid,
2 substr(a.program,1,25) program,
3 b.file#,
4 c.type
5 from v$session a,
6 v$datafile b,
7 v$lock c
8 where a.sid = c.sid and
9 b.file# = c.id1 and
10 c.type = 'MR';

SID PROGRAM FILE# TY
---------- ------------------------- ---------- --
167 oracle@machine (DBW0) 1 MR
167 oracle@machine (DBW0) 2 MR
167 oracle@machine (DBW0) 3 MR
167 oracle@machine (DBW0) 5 MR
167 oracle@machine (DBW0) 6 MR
167 oracle@machine (DBW0) 7 MR
167 oracle@machine (DBW0) 8 MR

7 rows selected.


sys@10g:SQL> alter tablespace users online;

Tablespace altered.


sys@10g:SQL> select a.sid,
2 substr(a.program,1,25) program,
3 b.file#,
4 c.type
5 from v$session a,
6 v$datafile b,
7 v$lock c
8 where a.sid = c.sid and
9 b.file# = c.id1 and
10 c.type = 'MR';

SID PROGRAM FILE# TY
---------- ------------------------- ---------- --
167 oracle@machine (DBW0) 1 MR
167 oracle@machine (DBW0) 2 MR
167 oracle@machine (DBW0) 3 MR
167 oracle@machine (DBW0) 4 MR
167 oracle@machine (DBW0) 5 MR
167 oracle@machine (DBW0) 6 MR
167 oracle@machine (DBW0) 7 MR
167 oracle@machine (DBW0) 8 MR

8 rows selected.


As you can see, the MR lock was acquired by the recovery session and was not relinquished until the tablespace was brought online. At that point the Database Writer assumed ownership of the MR lock for the file in question (file_id=4). In this manner Oracle can ensure that file media recovery is only being performed by a single recovery session; and that recovery cannot be invoked against an online file. These MR locks are indeed annoying as they pollute our otherwise clean v$lock dynamic performance view. But, they serve a very vital function in preserving the integrity of our data.

Tuesday, March 21, 2006

Unused Indexes and Scalability

I was asked an interesting question yesterday. The gist of the question is: Can eliminating unused indexes reduce latching in the database? The answer is likely “yes” in most production environments. Even though an index may not be used in an access path for any SQL in the database, the index is still maintained (in most cases) when the indexed columns of a table are modified. Even if we reduce latching owed to unused indexes, how does this improve the scalability of an application?

Let’s assume you have an unused B-tree index on a table. If an insert, delete or update affects the indexed columns for this index, then more times than not this index will need to be modified to accommodate the change. If this index is modified Oracle must find the relevant index block(s) to change. To do this it must read the root node of the index segment, probably a branch block and lastly the leaf block(s). All of this activity requires Oracle to hash the DBA (data block address) for each visited index block to determine if the block is already in the buffer cache.

Which latches are required for this investigation? Well, the DBA is hashed to determine which cache buffers chain on which the block “should” reside if it is in the buffer cache. Once the applicable cache buffers chain latch is acquired Oracle searches this linked list (cache buffers chain) , protected by that latch, hoping to find the current version of the leaf block (let’s not rule out a branch block change as well). If it is not found on the relevant cache buffers chain, then the block is not in the buffer cache. Therefore, the block must be read from “disk” and placed into the buffer cache. Oracle needs a free buffer to write the block into. This is accomplished by searching the applicable cache buffers LRU chain for a free buffer. The cache buffers LRU chain structure is protected by a cache buffers LRU chain latch. Once this latch is acquired and the block is written to the buffer cache, the block can be modified.

Of course, modifying the block requires Oracle to first allocate space in the log buffer to accommodate our change. Oracle will need to acquire the redo allocation latch to allocate this space in the log buffer. Then the redo change vectors, constructed in the PGA, are written onto the log buffer using a redo copy latch.

To maintain this unused index Oracle has performed unnecessary logical I/O (root, branch and leaf index blocks as well as undo segment header and undo segment blocks) and unnecessary latching. Moreover, unnecessary redo has been generated along the way.

Latching plays a major role in the ability of an application to scale, as concurrency is impacted via latch contention. Latches serialize operations on structures to ensure the integrity of that structure. So they are our friends, but they can easily become areas of frequent contention. The less latching Oracle does the greater the probability a process will succeed in getting the latch it needs on its first attempt to acquire it. Concurrency and, thusly, scalability are benefited. Latch contention can result in a process spinning (CPU intensive) on latches and sleeping (response time inhibitor) until the latch is available. Processes are not managed by a queue for latch access. That is, a process is not guaranteed a position in a queue based on the point in time it attempted to acquire the latch. It either acquires it immediately or it does not.

So, before we go adding indexes we think might be useful, we might want to consider the inherent maintenance overhead against the scope of any performance gain. If scalability is deemed paramount, each database or SQL change should be made with a high-level understanding of the resultant latching fresh in our minds.

Wednesday, March 15, 2006

Terrific Presentations on Oracle Redo

Recently, I came across a couple of really terrific presentations on redo internals and redo reduction by Julian Dyke. If you have any interest in some of the more subtle details of Oracle redo, I highly recommend you take some time to go through these presentations. In typical Julian Dyke form, they are lucid, concise and very informative.

They are available on the web free of charge. Check them out and let me know what you think!

Tuesday, March 14, 2006

Hotsos 2006: Day 3 – Revisited

This is a short follow-up post to my Hotsos 2006: Day 3 post, thus the extremely creative name for this post :) In Tom Kyte’s presentation “Performance Impact of the LOG ERRORS Clause in 10gR2”, Tom’s testing revealed that a single statement multi-row update using the LOG ERRORS clause generates nearly 2x the redo when compared to other methods of updating the same data: row-by-row and array processing via anonymous PL/SQL blocks. As I stated in my previous post, I wanted to take a look at this to understand why 2x redo is generated to accomplish the same task.

I was able to reproduce Tom’s results using the examples taken directly from the presentation. The additional redo owed to the multi-row update statement with the LOG ERRORS clause results from the row locking “approach” taken by Oracle during the execution of the statement. In the row-by-row processing the rows were locked ("to lock" effectively means to update the lock byte in the row header) and changed in the same redo record. However, in the aforementioned multi-row update using the LOG ERRORS clause each row affected is given two redo records to accommodate the change. The first redo record is for locking the row and the second redo record is used to affect the column data and column length changes. A dump of the log file showed the following pattern for the multi-row update using the LOG ERRORS clause:

1. Change the lock byte of row header for row 0 to the index value for the ITL entry associated with the current transaction.

2. Change the column data and column length attributes to reflect the updated column data and width for row 0.

3. Change the lock byte of row header for row 1 to the index value for the ITL entry associated with the current transaction.

4. Change the column data and column length attributes to reflect the updated column data and width for row 1.
.
.

5. Change the lock byte of row header for row N to the index value for the ITL entry associated with the current transaction.

N. Change the column data and column length attributes to reflect the updated column data and width for row N.

A dump of the log file for the row-by-row update yielded the following redo pattern, "lock the row AND change the row":

1. Change the lock byte information, column data and column length attributes in a single redo record for row 0.

2. Change the lock byte information, column data and column length attributes in a single redo record for row 1.
.
.

N. Change the lock byte information, column data and column length attributes in a single redo record for row N.

What is rather interesting is that this "lock THEN change" method of “acquiring” locks is not necessarily specific to multi-row updates with the LOG ERRORS clause. I tested multi-row updates without the LOG ERRORS clause and the same redo pattern emerged; lock the row then change the row.

Moreover, suffixing the update statement in the row-by-row PL/SQL block with the LOG ERRORS clause also produced the, “lock the row THEN change the row” redo record sequence. It appears that Oracle’s code path is adjusted necessarily by the LOG ERRORS clause to affect this pattern for update statements, irrespective of the number of rows the update statement modifies. However, this code path is apparently available to other multi-row updates not using the LOG ERRORS clause.

So, how does this account for 2x redo? It has been my experience that on average redo records are around 250-300 bytes for small changes. Much of this space is owed to the administrative data such as SCNs, operation codes, flags etc. Even a small single byte change can generate a redo record of 200+ bytes. In each and every test case I ran the redo record size for the single lock byte update was 208 bytes. If each row updated via an update with the LOG ERRORS clause created two redo records, one for the lock byte update and another for the small data change, this explains why nearly 2x redo is generated.

One point of clarification, the notion that Oracle locks a row before it changes a row is very fundamental. Indeed, our applications rely on it! However, it does appear that Oracle catalogues the changes owed to acquiring the row locks differently based on the task at hand.

Thursday, March 09, 2006

Hotsos 2006: Day 3

I started off the day with a pastry, a glass of orange juice and an hour of Tom Kyte. Not a bad way to start a day, now is it? Tom gave a presentation on the performance impact of the LOG ERRORS clause in 10GR2. This feature of Oracle is really nice. As Tom said, "It will change the way we extract, transfer and load data in the future." I totally agree. The gist of this new feature is the ability to log error messages (and other relevant row data) for the failed rows in a multi-row DML statement. Much like the .bad file in the SQL*Loader tool gives us the ability to record failed rows outside the database, the LOG ERRORS clause permits us to store essentially the same data in an "error" table. Circa, 10GR2 no longer do our multi-row DML operations need to be an "all or nothing" proposition. It is perfectly permissible for a pre-configured number of rows to fail. Moreover, the logging of errors is an autonomous transaction. That is, if you rollback the DML the error data can persist in the "error" table.

As for the performance impact owed to the LOG ERRORS clause, it really varies and can be a function of the number of errors encountered and/or the path (conventional or direct) taken by the statement. In general, DML error logging should use direct path processing. Obviously, it is abundantly safe if a high percentage of your data is clean. If you are running 10GR2 this is definitely an option to seriously consider for your data loads.

For me there was one unresolved issue with using this option. Based on Tom's testing, updates that use the LOG ERRORS clause generate nearly 2 times the amount of redo and nearly 2 times the undo when compared to a variety of other more clunky "coding" options that render the same result. Naturally, I will need to determine why this is the case :) When I get the answer I will create a follow-on post. Like I always say, Oracle cannot hide the nature of the redo!

In the afternoon Doug Burns gave another very good presentation titled "How Many Slaves? (Parallel Execution and the Magic of Two)." While I haven't consumed the entire paper that accompanies the presentation it is very obvious Doug has done his homework. What I truly appreciate with this topic, in particular Doug's paper and presentation material, is the rigor he has imposed on his investigations. While Doug will be the first to tell you that there are many more testing scenarios that can be envisioned (countless) and our results will be platform-specific, the key is to know your system and understand that the degree of parallelism is not an exact science. I encourage you to read his paper to get the details as they are extremely interesting. And remember, just because your system affords you the ability to run potentially higher degrees of parallelization to achieve greater response time this doesn't give us a license to abuse our systems. Everything has a cost. We should carefully evaluate our "need" for fast throughput for jobs deemed worthy of parallelizing as we could easily starve other well deserving processes of CPU.

Tomorrow, Jonathan Lewis is giving a training day on "Understanding and Assisting the Cost Based Optimizer." I am sure this will be a terrific ending to a fantastic week of learning from some of the world's best Oracle minds.

Tuesday, March 07, 2006

Hotsos 2006: Day 2

Okay. Day 2 of the 2006 Hotsos Symposium is in the books. It proved to be yet another really good day of learning and sharing of technology. One of my favorite aspects about the symposium format is the ability to get so many varied opinions on such a wide variety of technical topics by simply strolling the floor. You might have a fleeting conversation with someone you have never met and learn something new. That is cool!

Anyway, as for the proceedings I started off in Bob Sneed's "I/O Micro-benchmarking with Oracle in Mind" presentation. He made a few really good points pertaining to common micro-benchmarking errors:


  1. We often run irrelevant tests.
  2. Failure to pay attention to initial pre-test state influences can sabotage your benchmark (Beware of the warm cache).
  3. Failure to repeat tests to ensure repeatability of results.
  4. Don't leap to conclusions.

He also mentioned a couple of open source products that can be used for micro-benchmarking: filebench and iozone.

One last comment - which requires further investigation. He said that anyone owning the book Oracle Wait Interface: A Practical Guide to Performance Diagnostics and Tuning should rip out page 134. I will wait until I get back to the office before I read my colleague Michael Eubanks' copy and rip said page out :-)

Then it was on to Doug Burns' "Tuning and Tracing Parallel Execution" presentation. First, off this guy can flat out present (and with terrific humor). I thought the presentation was very good. Some of the notable points were the following.

Why do so few sites implement the Parallel Query option? Well, I didn't know that so few sites were using it, but I will certainly take his word. He mentioned a few reasons:

  1. Lack of understanding of the nuisances of PQ.
  2. Deterred by horrible early experiences (i.e. cpu starvation, unpredictable results).
  3. Community's resistance to change.
  4. The very fact that it is not useful in all environments.

He advocated, with a slight nudge from Tom Kyte, the use of the parallel_adaptive_multi_user parameter to adapt the PQ slave use to the current workload on the system. In this manner, we can theoretically take full advantage of Parallel Query while reducing the risk of CPU starvation. If you know me, you know that I love a diagram that can say in one slide what might take 20 bullet items to "spell out". Doug had one particular slide that showed the interaction between parallel query slave processes and the parallel query coordinator processes during a sort operation. It revealed how parallel query slaves can assume the role of both a consumer and a producer at different points within the execution of a single statement. That added a new wrinkle to my gray matter.

He also described some of the challenges of tracing PQ processes in pre-10g Oracle installations, i.e. one trace file per slave and coordinator. However, in 10g this is remedied via the use of trcsess as it generates a consolidated trace file for the query coordinator and the slave processes. I also thought his review of the parallel query wait events and related dynamic performance views topped off a very good presentation. Great job Doug! I look forward to your presentation on Day 3.

I thought Tanel Poder did a really good job in presenting his "End-to-end Performance Diagnosis in Oracle" presentation with relevance and clarity. To summarize his presentation, the highest probability in diagnosing performance issues is usually a correlative of the instrumentation at each layer. The less instrumentation that exists at each "layer" of the application the more time it will take to complete the diagnostic tasks at hand. The number of cycles of "why" -> "where" -> "why" -> "where" -> .... -> "root cause" will typically be greater as the instrumentation at each layer diminishes. The converse is likewise true. Of course, the cost of instrumentation to performance must be weighed against any apparent overhead said instrumentation incurs.

I rounded off the day with back-to-back presentations from Jonathan Lewis and Mogens Norgaard. Jonathan spoke on the "V$ and X$" performance views. With erudition that ostensibly only Jonathan can provide, the V$ and X$ views were exposed for what they truly are, views which can be immensely useful and recreated in a much for database-friendly manner. Unfortunately, this talk only lasted an hour. However, I am privileged to be able to attend the training day with Jonathan on Thursday. So, I am sure any gaps not fully appreciated in this one-hour presentation will be realized in this training session. Jonathan's understanding of Oracle internals is truly amazing. It is obvious he has invested much energy to decode some of the more mysterious aspects of Oracle, such as the v$ and x$ views; in an attempt to deliver to the average DBA quick access paths to otherwise potentially prohibiting queries against these prized structures. Just when you think you know a thing or two about Oracle internals, a Jonathan Lewis presentation quickly reminds you of how little you really know. As Socrates said, "True knowledge exists in knowing that you know nothing." When Jonathan Lewis or Tom Kyte present I feel my seat getting larger and larger beneath me.

Mogens delivered a fantastically witty and delightful presentation to round off the day. I set down my pen and closed my notebook to simply enjoy the offering. I have found Mogens to be one of, if not, the best person to deliver a high level technical discussion seasoned with relevant and primarily non-gratuitous humor. If the majority of the world understood databases he could easily give a stand-up routine each and every night at any venue in the world. He is that good.

Monday, March 06, 2006

Hotsos 2006: Day 1

Well, the education portion for Day 1 at Hotsos 2006 has ended (at least formally). There was some very good information given today. However, I think the keynote speech given by Ray Lane, former President and COO of Oracle Corporation, was potentially the most interesting. It was not a techy, nuts and bolts delivery but he shared some very fresh ideas borne from his vast experience in the industry.

Mr. Lane spoke of the software paradigm shift in which we are currently in the midst. He quoted his seven laws for successful software in the future:


  1. Serves Individual Need
  2. Viral/Organic Adoption
  3. Contextual Personalized Information
  4. No Data Entry/No Training Required
  5. Utilitizes Community, Social Relationships
  6. Delivers Instantaneous Value
  7. Minimum IT Footprint

In a nutshell, Mr. Lane contends that the innovations to come, and indeed already present in many open source products, will dictate the way corporate America invests in its enterprise software. He stated, "All the things we desire in our enterprise software are already understood by the individual user community." Essentially, if software is to survive in the future it needs to meet the seven conditions listed above. For example, many people really value their wireless phones as they can serve as a planner, camera, web browser, text messenger, etc. And to quote Mr. Lane, "When was the last time you bragged on your new enterprise software?"

James Morle delivered a very good presentation titled "Sane San." I think a couple of really good ideas to take from his presentation are:

  1. Disks are getting bigger and faster. But, the bigness is growing faster than the fastness. That is definitely something to keep in mind when capacity planning is on the table.
  2. We tend to price our DASD by the gigabyte/megabyte. Why aren't we costing I/O? That is, what does an I/O cost in our production environments?

I do have a couple of thoughts on Julian Dyke's presentation. He delivers technical topics with great humor. I really appreciate that in a technical presentation. Moreover, his animations for buffer cache management in RAC and non-RAC environment were awesome! It is really a talent when you can present technical ideas in such a basic manner, using animations, that it renders the notion nearly trivial. Animations are much easier to remember than an endless array of bulleted items. While RAC buffer cache management is not nearly as profound as, say, astrophysics I can't help but think about Richard Feyman's famous quote given to one of the few freshman classes he taught. He was asked by a professor attending one of his lectures to comment on a particular phenomenon. Feyman obliged and stated that he would present a freshman lecture on it for the next class. When the day arrived to give the lecture Feyman opened with, "I couldn't do it. I couldn't reduce it to a freshman level. That means we really don't understand it." I love that quote.

Lastly, I attended Rijay Shamsudeen's Redo Internals and Tuning by Redo Reduction presentation. Of course, this topic is near and dear to me so I was very excited about this presentation. He delivered a very clear and concise expose into redo internals that was both elucidating and informative. I learned several new redo internal tidbits and look forward to reading the accompanying white paper.

All in all, it was a very good first day at Hotsos 2006.

Hotsos 2006: Day 0

Well, after a few unintentional "detours" my colleagues and I arrived at the Omni Mandalay too late for registration. Tomorrow morning will have to suffice. However, we did get an opportunity to meet for a while with a former DBA colleague Dan Furnier. It is always nice to catch up with former co-workers.

There is a full day of learning on tap for tomorrow. I am really looking forward to tomorrow's presentations - so much to learn and so little time.

Time to turn in. I need to maximize my fresh brain cells for tomorrow. Hotsos 2006 is officially here!

Friday, March 03, 2006

RMAN is Absolutely Fuzzy

This is a bit of a follow-on post to my The Pleasure of Finding Oracle Things Out post in which I spoke of some Oracle internal related mechanics relating to user-managed hot backups. In this post I want to focus on RMAN and how, subsequent to a restore of a data file from an RMAN formatted backup set or an image copy to disk, Oracle knows what needs to be recovered. Remember, with user-managed backups the checkpoint SCN is frozen after the checkpoint issued by the BEGIN BACKUP completes. This tells Oracle where to start recovery when the file is restored. We also know that the END BACKUP command creates a redo record that corresponds to this backup – tied together via the checkpoint SCN. During recover Oracle is privileged to the operations that define the backup-necessary redo. That is, the BEGIN/END BACKUP commands act as recovery bookends to preserve consistency.

However, RMAN is a totally different beast. We know that RMAN permits us to take hot backups and restore either image copies or backup set formatted files for recovery. This is well documented and I am sure you have reaped the benefits of this RMAN feature in your DBA adventures. But how does it work? (I love that question)

Yes, we read of RMAN handling the fractured block dilemma inherent in the user-managed backups by re-reading changed blocks to ensure block consistency. And, you have probably read that it reads the file header block first and only backs up blocks that have been previously modified (let’s stick to level 0 backups here). Oracle also issues a checkpoint on the applicable file(s) before backing up said file(s) for either a backup set or image copy. Okay, but what can we say of the bookends? When we restore a file from a backup set or an image copy what defines the bookends for the backup-necessary redo? Backup-necessary redo can be defined as the redo that is required to make a fuzzy/inconsistent backup file a non-fuzzy/consistent file. Of course, we often continue past the backup-necessary redo to get to a further point in Oracle time.

If you stop for a moment and really think about it, defining the bookends for RMAN is not a very easy proposition. We know it works. We use it. We like it. How often do we question it?

Let’s create a very rudimentary example to demonstrate the nature of the RMAN bookends. In this example we will backup a file (level 0) comprised of six blocks (all blocks previously used therefore subject to our level 0 backup) to a backup set – to disk or tape it does not matter. The figure below depicts the states of the blocks in our sample data file at different points in time. Block 0 is the header block and blocks 1-5 are for data. Listed under the block number is the checkpoint SCN for the header block and the block SCN for the data blocks. For example, at time=0 the state of the data file is a checkpoint completed at SCN 120 and blocks 2, 4 and 5 have been modified since the last checkpoint. Blocks 1 and block 3 were modified prior to the checkpoint at SCN 120. What would happen if we started an RMAN backup at time=0? Oracle would issue a checkpoint on the file and start reading the blocks in order from 0-5. For simplicity assume we have only a single backup process reading the data file during the backup. Also assume the backup finishes reading the data file at time=4. What would the backup piece “contain” such that a subsequent restore would be primed for recovery?

First we need to know which blocks are backed up, and when, before we can determine what will be written to the backup set. Let’s use the following diagram to depict the state of the backup set if the backup is started at time=0.


At time=0 RMAN issues a checkpoint against the data file and reads the header block into memory. Assume the header block is buffered in the Large Pool (Oracle could just as easily buffer the block in the memory of the shadow process backing up the data file). I found a presentation on the Web authored by a member of the Oak Table that asserts the header block is read first and written last. Indeed, we will see this needs to be the case to satisfy both bookends of the backup-necessary redo.

As we can see at time=1 we have only backed up block 1. At time=2 RMAN backs up block=2. At time=3 it backs up blocks 3-4. Lastly, at time=5 it backs up block=5. Remember, before RMAN started it issued a checkpoint against the file and preserved the SCN from the completion of the checkpoint in the buffered image of the header block – the header block checkpoint SCN in this case was 125 as the checkpoint completed at time=1. Notice the state of the blocks written to the backup set. They vary in Oracle time as they contain different SCN values. The fact that block=4 has an SCN that is greater than the checkpoint SCN recorded in the buffered image of the header makes the backup of this file inconsistent or fuzzy (file header status 0x40). How does Oracle resolve this? Well, Oracle reads the SCNs of each block during the course of the backup. If a single block backed up has an SCN greater than the checkpoint SCN all intermediate SCNs need to be accounted for during recovery. For instance, in our example block=1 was changed to SCN=126 while the backup was being performed. Moreover, it is less than the highest block SCN encountered (block=4 has SCN=127). However, the backup set has the state of this block at SCN=123. If Oracle knew the highest SCN encountered during the course of the backup it can account for any recover needed between the checkpoint SCN and this highest SCN.

As it turns out, Oracle reserves a section of the file header block of each file for just such an occurrence. This is called the Absolute Fuzzy SCN and represents the SCN required for recovery to make this a consistent file. Our bookends are then defined as the checkpoint SCN and the Absolute Fuzzy SCN. At a minimum, Oracle must recover from the checkpoint SCN through the Absolute Fuzzy SCN for consistency. If Oracle did not detect any SCNs higher than the checkpoint SCN during the backup then the backup would be considered consistent (file header status 0x0) and the Absolute Fuzzy SCN would remain at 0x0 - obviating the need for any backup-necessary redo to be applied. As you can see, this is the reason Oracle waits until all data blocks in the file have been read and written before it writes the header to the backup set. This permits the proper settings for the bookends. You can find the Absolute Fuzzy SCN at the bottom of a file header dump.

.
.
Absolute fuzzy scn: 0x0000.004da811
Recovery fuzzy scn: 0x0000.00000000 01/01/1988 00:00:00
Terminal Recovery Stamp scn: 0x0000.00000000 01/01/1988 00:00:00

What about image copies?

RMAN does not really “behave” any differently for an image copy. It stills performs a checkpoint and still notes the Absolute Fuzzy SCN if applicable. What is different however, is that an image is, well, just that, an image. Therefore, the position of the header block is in its natural position at the logical front of the backed up data – not the end as in the backup set case. This is yet another reason, if not the primary reason, why Oracle does not permit image copies to be backed up directly to tape – it could not set the Absolute Fuzzy SCN in the header block because the header block would have already been written to a sequential access medium.