Thursday, April 27, 2006

The Oracle Certification Process

In late 1997, on what was a very shiny day (for all of DBA-kind I am sure), I proudly exited my local facility that proctored Oracle certification exams. On this glorious day I had passed the last of four exams required to obtain the coveted Oracle Certified Professional (OCP) title. I was certified on Oracle 7.3 and could not have been more proud. After waiting a few weeks to receive my certificate I brandished it in my home study. Make no mistake about it. I felt this had legitimized my 3 long years of Oracle work to date - I had reached an Oracle summit. At that time the OCP title was not nearly as pervasive as it is today. In hindsight, I suppose my enthusiasm was not entirely unjustified.

Let’s roll time forward nine years to 2006. I have not renewed my certification. For all practical purposes I am not an OCP. I certainly wouldn’t claim such on my resume having only achieved version 7.3 certification. Why haven’t I renewed my certification? After all, Oracle has bent over backwards to assist this erstwhile OCP by offering an upgrade exam. I can take a single exam and immediately upgrade my certification status to an Oracle 9i OCP. If I labored a bit more, I could take another upgrade exam and attain the highest OCP level available. Does this mean that I could, nearly overnight, claim expertise in all of the concepts and elegant nuances Oracle has built into its database since version 7.3? Professionally, on my resume, I suppose the answer is yes. Realistically, the answer is, no way!

I feel the only real way to stay current with our Oracle knowledge and exhibit the technical acumen associated with a proficient Oracle practitioner is to read (and reread) documentation and test features. There is absolutely no substitute for good old-fashioned studying in conjunction with trial and error exercises. I have interviewed dozens of Oracle Certified Professionals over the years, many of which struggled with the basics. I do believe that today, more than ever, the ubiquitous OCP title provides little insight into the qualifications of an Oracle DBA. However, I do believe that the certification process can lay an excellent framework for a strong understanding of the Oracle database. Just, not by necessity. It varies from person to person. One person with the same temporal experience with Oracle and an OCP title might appear lacking when compared to another with equal “qualifications” and accomplishments. Why? We all have different approaches to storing information for retrieval. I remember cramming for exams in college for the courses I loathed. I always seemed to make out okay. But, did I really learn the material or just buffer it long enough so that my mind could hurl it back out in the nick of time? I know, for those “undesirable” classes it was the latter. For me to learn I must:

1. Want to learn.
2. Be passionate about the topic.
3. and study, study, study.

Of course, there are exceptions to the rules, those supremely intelligent humans that roam the earth with a glut of gray matter that have little need for 3), leaving it for the rest of us to toil.

Am I a better DBA than I was nine years ago? I certainly hope so. Could I augment the breadth and depth of my Oracle knowledge by revisiting the certification process? Absolutely. But, couldn’t I really do the same by studying the material covered by the exams? After all, I am passionate about the topic and want to learn. I know. I know. It sounds like a really cheap excuse. Read the material, but, uh hum, skip the exams right? How convenient.

For those of you with your OCP please don’t think I am minimizing your achievements. I am certainly not doing so. I believe that the Oracle certification process can yield a very productive learning experience, insofar as we really take the time to authentically learn the material we are studying. It has been my experience, that if I have ostensibly forgotten what I have learned, as long as I truly understood the material while in the learning process, re-learning can be a very quick enterprise.

By the way, I think I will take the upgrade exams this year. But, this time I refuse to cram. I will revisit the exam topics with a cheerful willingness, as the science of Oracle database administration is a very exciting and challenging branch of knowledge.

Thursday, April 20, 2006

Solaris and High Wait I/O CPU

A few days back a familiar little situation surfaced. Someone monitoring the OS was making claims that a particular machine was running at 100% cpu utilization during a period when a portion of the application was running slower than normal. The assertion being made was our system had a cpu shortage.

Given the fact that the application was running on a Solaris platform I looked at the vmstat history logs kept for just such an investigation. Per vmstat, for the time in question, there was plenty of idle cpu. Immediately, I thought this person must have been looking at the sar data on the machine in question. Sure enough, the sar data indicated a very low percentage of idle cpu. As you might have guessed, the percentage of time the system was waiting for I/O was rather large according to sar and, consequently, low idle time was being reported. I explained that it was typical for this system to run a high wait I/O percentage as reported by sar; after all, it is a database server with many processors. I also explained that low idle time as reported by sar does not necessarily mean a cpu bottleneck exists.

I remembered reading in Adrian Cockroft’s book, Sun Performance Tuning, that vmstat lumps wait I/O into idle time. So, naturally I was confident in my counter-assertion that our cpu utilization was just fine. I assuredly reached for my copy of the Sun Performance Tuning book to show where I had read this information years ago. I searched the index of the book and gave the book a cursory once-over to no avail. I started doubting whether I had reached for the wrong text! A bit frustrated I decide to perform a full book scan. Low and behold, I only got past two pages before my memory was vindicated. On page 3 it reads. “Whenever there are any blocked processes, all cpu idle time is treated as wait for I/O time! The vmstat command correctly includes wait for I/O in its idle value…” Viola!

The clock interrupt handler in the Solaris operating system runs every 10ms (or at least used to) to get cpu utilization information. It will search the state structure for each cpu and find that each cpu is in one of five states: user, system, idle, waiting for I/O or quiesced. Based on my understanding, the quiesced state is not really indicated by a value stored in a structure or variable associated with a cpu. It is simply the state when a cpu is not running user, system or idle threads and not waiting for I/O.

The point is, a high value for wait I/O generated from sar on a Solaris platform does not indicate a cpu bottleneck. Moreover, high wait I/O values do not necessarily indicate an I/O bottleneck. However, an I/O bottleneck could very easy manifest in high wait I/O percentages. You really need to look at your I/O service times to determine if the I/O subsystem is performing poorly.

For those wanting to know more on the algorithm used by Solaris to calculate idle and wait I/O cpu percentages read here. It is a bit dated, but describes how wait I/O is tallied in the Solaris operating system (at least in earlier versions). Interestingly enough this article cites Sun Performance Tuning, my trusty reference.

Tuesday, April 18, 2006

Getting a Handle on Logical I/O

The other day a colleague brought to my attention an interesting situation related to one of the databases he supports. The database was, rather consistently, experiencing heavy cache buffers chains (CBC) latch wait events while processing against a set of “related” tables. The solution devised to mitigate the CBC latch contention involved range partitioning said tables. I believe proper partitioning can be a very reasonable approach to minimize the probability of CBC latch collisions. Of course, you must know the manner in which your data is accessed and partition accordingly, as you don’t want to sacrifice existing solid execution plans among other considerations.

As it turned out, the partitioning approach did indeed reduce the CBC collisions; albeit another form of contention surfaced as a corollary, cache buffer handles latch collisions. I must admit I had a very limited knowledge of buffer handles prior to being made aware of this situation. My colleague pointed me to a very interesting article on Jonathan Lewis' site. This article gives a pithy description of buffer handles. I highly recommend you carve out a few minutes to read it. Not only might you learn something about buffer handles, you might be surprised that the more traditional notions of logical I/O do not really suffice. I was first suitably introduced to the buffer is pinned count statistic during a Hotsos training course. Essentially, this statistic indicates the presence of latch-reduced logical I/O.

While, generally speaking, Oracle recommends that hidden parameters not be changed, sometimes they need to be modified to accommodate very specific issues your database is encountering. In this particular case, increasing the value of the _db_handles_cached parameter got rid of the newly surfaced collisions on the cache buffer handles latch. I love learning from others’ experiences. It is amazing how many interesting little tales such as this exist. Also, this type of unforeseen contention shifting reinforces the need to properly test production changes - or maybe better said, the ability to properly test production changes.