## Friday, December 21, 2007

### Data Pump Rocks!

For the past week, we dove into Data Pump with a vengeance. I have never played with it before; I think I may have tried a table export or two, but nothing really serious. So last week I had no clue what a huge improvement over the "traditional" exp/imp Data Pump is. Of course, it goes without saying that there will be bugs and gotchas, but that pretty much comes with the territory. Sad but true.

The purpose we engaged in this activity in the first place is that our ERP ostentatiously decided that they want to support international characters, which means that all of us state-side have to upgrade regardless of any character needs. They didn't ask me! *grin* But the silver lining is that we have learned so much about Data Pump.

Just for the sake of numbers, we were able to pump the data from a 478gb database to a set of dumpfiles totaling 175gb in about 3 hours, and then turn around and pump that back into an empty database in about 4 hours or so. Yes, there are some hidden truths in there. For instance, we completely ignored indexes; they will explode your timeframe significantly.

EXPDP
We hit some issues early on with the export. At first, we thought we were hitting the LOB issue. After filing an SR, we learned of Metalink note 286496.1 which covers tracing and an "undocumented" METRICS parameter. METRICS seems like a vastly handy little piece of information, so I am quite flabbergasted that it is "undocumented". According to the note, we set TRACE=480300, which has some trace information for the Master and Worker processes. In addition, I set event 10046 to get the waits. That was eye opening. The database was spending an enormous amount of time waiting on "library cache pin" while creating Global Temporary Tables. Very odd. After playing ping-pong with the Support Analyst and thinking about it for a while, I realized that all of the objects experiencing a wait had FGAC enabled for VPD. Ok, one strike against me for not choosing a VPD-free login (ie, one that has been granted EXEMPT ACCESS POLICY), but one strike for Data Pump for doing something rather poorly. I am hoping to hear more about this particular behavior and how it will be resolved.

We also set our TEMP space to extend without limit and set the pools (shared pool and buffer cache) a bit higher than normal.

The parameter file we used:
directory=DPUMP
dumpfile=${ORACLE_SID}_full%U.dmp logfile=${ORACLE_SID}_full_debug.log
full=y
parallel=16
metrics=y
userid="xxx/yyy"
TRACE=480300

IMPDP
The very first thing I tried was NETWORK_LINK; for "traditional" exp/imp, we use a pipe as an intermediate file instead of exp to a dump file and imp from the dump file. Since Data Pump writes asynchronously, this is not possible, but the alternative is to communicate via a Database Link. Unfortunately, LONG objects are not yet supported via this method, excluding this option as a viable method.

The next problem we encountered was that the metadata contained a tablespace specification for a non-existent tablespace on two of our partitioned tables. This turned out to be extremely counterintuitive. If you precreate the table (on the proper, existing tablespaces), IMPDP will fail trying to create the table (on the wrong, non-existing tablespace). Even if you specify TABLE_EXISTS_ACTION=TRUNCATE!! Our Support Analyst is telling me that this is the expected behavior. Was not my expectation at all. To fix it, we create the tablespace and viola, we have a working import process.

Lastly, we struggled for a long time with the arduous process of creating indexes. Data Pump
says it is creating indexes in parallel. In reality, it using the parallel degree clause of the CREATE INDEX statement, utilizing the RDBMS parallel server processes. This seems rather antithetical to the rest of Data Pump, especially if you consider that no matter how much parallelize, you bottleneck with the query coordinator. I much prefer that Data Pump use a Parallel Degree of 1 for the index creations, but launch multiple creations at the same time. In fact, I downloaded a script called paresh. I had to modify it a bit to use the DDL generated by IMPDP Metadata for indexes, but it seems to work quite well. I need to modify it more for error checking, but at least it creates indexes truly in parallel.

So, with that out of the way, we are now working on a "clean" import given these exceptions. For the init.ora, we use:
nls_length_semantics = CHAR

## Import Debugging
max_dump_file_size = unlimited
# event="10046 trace name context forever, level 12"

## Import Speed-up parameters
shared_pool_size = 1000M
sga_max_size = 2000M
parallel_max_servers = 24
_disable_logging = TRUE
DB_BLOCK_CHECKSUM=FALSE ## DEFAULT = TRUE
## DISK_ASYNCH_IO=TRUE ## DEFAULT
## DB_BLOCK_CHECKING=FALSE ## DEFAULT

A note of WARNING about _disable_logging. It is an underscore parameter, so all the usual warnings accompany that. I found out the hard way what happens if you shutdown abort while attempting to rollback sql statements:
SMON: following errors trapped and ignored:
ORA-01595: error freeing extent (3) of rollback segment (1))
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [4193], [114], [232], [], [], [], [], []

Our IMPDP parameter file is similar to EXPDP, but excluding some objects:
directory=DPUMP
dumpfile=${ORACLE_SID}_full%U.dmp logfile=${ORACLE_SID}_full_no_indexes.log
full=y
parallel=16
metrics=y
userid="xxx/yyy"
TRACE=480300
EXCLUDE=index,constraint

We are still playing with the fastest way to migrate constraints. One thought is to do two passes:
• disable constraints
• import data
• enable constraints in parallel
• build indexes in parallel

Another thought is to disable the constraints on the source, but that may not be practical for our situation. We will see.

In the End
I am optimistic that our latest changes are going to produce some fast numbers. Of course, there are other areas that could be tweaked (place dumpfiles on independent mount points, for example), but we went with the low-hanging fruit, and I think we scored. Another outcome is that we have made contact with the Data Pump Product Manager and her supervisor, which is priceless! They are excellent people and very patient and willing to listen. Amazing folks over there!

## Friday, December 07, 2007

### All I want for Christmas

We have a particularly naughty database this week. However, it still has high hopes for the "giving season":

SQL > select dbms_random.string('U', 4) from dual;
DBMS_RANDOM.STRING('U',4)
------------------------------------------------------------------------------------------------------------------------
RIBS

That was for real. We had quite a laugh about that little coincidence this morning.

## Monday, November 26, 2007

### Managing CRS, part 1

This CRS beasty is a bit much to chew on. Maybe it is just me.

Anyway, I started asking around how to check our mount and start options for databases registered with CRS. Strangely, nobody had an answer for me. Probably lack of me asking the right question, rather than lack of knowledge. But, I did find an easy answer:

srvctl config database -d DB_NAME -a

This helped me understand why our standby RAC databases were opening in read-only mode as opposed to mounting into a recovery mode. I had been following the MAA documentation for setting up a standby, which ostensibly does not cover the mount and start options (nor role) for standby databases. Very curious. I modified our standby databases with this kind of command:

srvctl modify database -d DB_NAME -r PHYSICAL_STANDBY -s mount -y AUTOMATIC -p +DATA/db_name/spfiledb_name.ora

While these options are well documented, you have to find it first. For those of us who are new to the scene, we do not always have the faintest idea of where to look. Is it a CRS command, or an OCR command? While you can usually depend on folks in the RAC SIG and oracle-l to help out, sometimes they are just too busy (hmm... it is Thanksgiving....). Or, in a twist of Jonathan Lewis' quotes, "Sometimes, you just get unlucky."

So now that I figured out that one small piece to the puzzle, I have stumbled upon other questions. How do you get a report for all the start/mount options for all databases? I do not really want to run srvctl config for each database. What about other services, like the listener? I tried playing around with srvctl config listener, but I am not getting anything useful out of it. Especially since there is no -a flag. I am currently trying srvctl config service, but all my half-baked guesses are not getting me anywhere. I tried variations on the listner name and what I thought the service was, and I also tried the name reported by crs_stat.

The lack of comprehensive tools (like a good 'du') still bug me with the ASM as well.

## Friday, November 16, 2007

### Reverse mapping ASM disks

As we have been working with our sysadmin and storage folks, I often have to do some digging to find out which ASM diskgroups belong to which volume, and what devices those volumes are on. Fortunately, we only have 4 at the moment, so it is a quick dig. However, I am always disappointed that Oracle did not provide an easy way to do this. Or if they did, they did not make obvious mentions in any of their documentation.

Google showed me a great, concise script that Alejandro Vargas wrote. I enhanced it a little to go against the ASM instance to grab diskgroup information as well.

--- start ---
export ORACLE_SID=+ASM
. oraenv

$ORACLE_HOME/bin/sqlplus -S "/ as sysdba" << EOS 2>&1 |grep [A-Z] > asmdisks.txt set head off feed off select a.group_number||' '||b.disk_number||' '||a.name||' '||b.label from v\$asm_diskgroup a, v\$asm_disk b where a.group_number = b.group_number / exit EOS printf "%-9s %-30s %-3s %-10s %-3s\n" "ASM Disk" "Device Path [MAJ,MIN]" "GRP" "Disk Group" "DSK" /etc/init.d/oracleasm querydisk /etc/init.d/oracleasm listdisks | cut -f2,10,11 -d" " | perl -pe 's/"(.*)".*$(.*), *(.*)$/$1 $2$3/g;' | while read v_asmdisk v_minor v_major
do
v_device=ls -la /dev | grep " $v_minor, *$v_major " | awk '{print $10}' grp=grep$v_asmdisk asmdisks.txt|cut -f1 -d" "
dsk=grep $v_asmdisk asmdisks.txt|cut -f2 -d" " diskgroup=grep$v_asmdisk asmdisks.txt|cut -f3 -d" "

printf "%-9s /dev/%-25s %-3s %-10s %-3s\n" $v_asmdisk "$v_device [$v_minor,$v_major]" $grp$diskgroup $dsk done \rm asmdisks.txt ---- end ---- ## Monday, November 12, 2007 ### Day 3: Understanding and Assisting the CBO Session 1: Basic Cost Arithmetic Anyone that has read any of Jonathan's previous works will recognize the information surrounding the costing formulas; he also gives generous credit to Wolfgang Breitling and Alberto Dell'era. What is really great is that all three of these guys are very generous in sharing of their knowledge and experience, wishing to enhance the community by helping others. Just amazing! One of the first things Jonathan went over was the environment; there are a lot of factors that play a part in calculating costs of various different pieces of Oracle. The system stats are very important in that they determine cpu speed and io latency, which in turn determine how fast basic Oracle operations occur. Another important number is multiblock read count. I found it very interesting that the hidden underscore parameter _db_file_exec_read_count defaults to db_cache_size/processes (if db_file_multiblock_read_count is not set). Processes; so let's say you set the number of processes really high "just because". You can see that your exec read count will be quite small, and for no good reason. Jonathan also talked about the importance of sizing the memory parameters appropriately. I think the general impression is that you do not want to gimp your system by being too frugal with memory. Obviously, this will affect large operations (sorts, hashes, etc) more than anything else, but those large operations can get real nasty real fast (think multipass for hashes). Two underscore parameters that Jonathan highlighted were _pga_max_size and _smm_max_size (unfortunately, there were not many details on what these two did, or I missed them). He made a very interesting statement in that truncating a table does not reset the stats. That sounded very strange to me. Is that true in 10g?? If Oracle goes to the trouble to reset the highwater mark, why would the stats not be updated? They merely have to be zeroed out. We spent a good chunk of time on clustering. Not just in this session, but in others as well. There is quite a serious flaw in how clustering is calculated for data that is scattered. Consider the case where row(n) is in block(mod(n,2)), or in other words, all the odd rows are in block 1, even rows in block 0. To determine clustering, Oracle will walk the data and count each time the block id changes. Since the id changes for each row, Oracle will calculate the clustering factor really really small, when in fact, the clustering is actually pretty good (total of two blocks, half your data is in one block or the other). A low cluster factor translate into a high IO cost. An articially high IO cost may lead to a sub-optimal plan (where a good plan that has a false high IO cost is ignored in favor of a lower-cost other plan). This also prompted me to learn more about clustering factor. Related to that, I was convicted several times of a need to identify what our "significant" or "important" data is. That is probably the number one question I returned to again and again. What is the Banner "business important" data? There are several dictionary views and object statistics (including predicate statistics) that I want to investigate further, but I actually do have an overly optimistic hope that our vendor has some idea. There are a couple other flaws in IO costing that Jonathan pointed out. Flaw #1: Oracle assumes that index block reads are from disk, not cache. Always. Flaw #2: Oracle assumes index single-block read is the same cost as table multi-block read. Always. TO address Flaw #1, we have optimizer_index_cache, which tells us, on average, how many of our index blocks are in cache (expressed as a percentage). Jonathan stressed that this is only relevant for index root and branch blocks, as leaf blocks are not cached. I have a hard time believing leaf blocks are not cached, and that is something else I would want to look into at some point. Perhaps I merely misunderstood him. For Flaw #2, we have optimizer_index_cost_adj which tells us what percentage of a multiblock read is an index read (probably somewhere in the vicinity of 33%). However, for both issues, Jonathan suggests that neither be set if (a BIG IF) system stats have been collected and are correct. Jonathan pointed out a curious rounding error introduced with optimizer_index_cost_adj; the calculated cost will be rounded down, which can potentially lead to the wrong index being chosen. Event 10183 can be used to turn off cost rounding. On the topic of multiblock reads, Jonathan showed us a chart demonstrating how Oracle scales the parameter away from excessively high numbers. For a value of 4, the adjusted value may be 4.175. But for higher values, say 32, the adjusted value might be 16.407. 128, 40.85. Due to system stats, Jonathan is recommending that most people might want to turn off db_file_multiblock_read_count. In light of the importance of system stats, it would be good for us to review sys.aux_stats$, and or dbms_stats.get_system_stats().

Since clustering_factor has no meaning with bitmap indexes (think about it), it is overloaded to count the number of entries in the bitmap; "some key values will have multiple rows".

Even though there was only one slide and a short blurb about this, I was struck by the last point of the session. Using first_rows_n optimizer modes is really going to play havoc with explain plans. Costs and execution paths are essentially squeezed to fit into the first row target, much like doing a "where rownum <= n". This fact alone makes it very difficult to interpret exactly what the explain plan is attempting to convey.

Session 2: Join Mechanisms
Understanding Joins is important because Oracle attempts to convert everything into a two table join, including subqueries. This is not necessarily bad, it is just the way Oracle does it.

Jonathan's slides included a few examples of correlated and non-correlated subqueries. Jonathan's observation is that most "bad" join decisions are made because of cardinality being off, be it off by one or off by millions. Being off by one can go a long way. Which, in my mind, makes Wolfgang's Cardinality Feedback all the more important. Also, "off by one" can be really critical when you consider that perhaps the stats are just a tad old and new rows have come into the table, or have been deleted. That could be the difference between a "good" join and a "bad" join. Scary, eh?

There are a number of slides that talk about Nested Loop joins, but I am going to skip over to Merge joins. Actually, skipping all the way to One Pass sorts. There is an Urban Legend that all in-memory sorts are faster than disk sorts. Jonathan prooved this is not always the case. Up until 10.2.0.2, Oracle used binary insertion trees to maintain a map for merges; the bigger the memory, the larger this tree could potentially become (smaller memory would force a flush to disk). Because of such large sizes of binary trees, the CPU requirements shot up exponentially to maintain the index. Hence, with those versions, more memory actually made those merge operations far worse.

Jonathan had a caution about 9i 10053 trace events; the "Max Area Size" listing is completely wrong and useless, for it ignores the Workarea_size_policy parameter.

And more caveats. When Oracle executes asynchronous writes, the wait time is not recorded (because it is asynchronous). You have to keep an eye on direct_path_temp to see if the numbers are stacking up.

He introduced two trace events, 10032 and 10033. The former dumps statistics about sorts, the second traces IO for sorts to disk. Since file sizes are small for event 10033, it is not impractical to set it at the system level when diagnosing a sort issue.

Multipass sorts are extremely bad. Avoid them!! Setting sort_area_size large enough to avoid multipass sorts is highly recommended. He quotes Steve Adam's as saying you should be able to get by with a sort_area_size of 45mb to sort 12gb.

Hash joins are a bit interesting. Among other things, Jonathan talked about how the workarea_size_policy comes into play. If set to manual, each hash queue will get hash_area_size/2 memory. What happens if you have more than 2 queues? That's right, you have queues whose sum of memory exceeds hash_area_size. The worst part is that each queue gets this memory regardless if it is needed or not. Say you want to hash 2k of data. You still get hash_area_size/2 for each queue. Wastage!

Setting workarea_size_policy to automatic allows the queues to only grab what they need.

Jonathan also spend some time on Trace Event 10104, "One of the most useful traces in the Oracle suite". It gives a comprehensive break down of hashes, and can be used with Trace 10046 for even greater detail. For multipasses, look for terms like how much memory is available (Memory for slots, in bytes) vs Estimated Build size. Also "Total number of partitions" vs "Number of partitions that fit in memory". That might be your first clue. There is also a stat for number of rows iterated; the fewer iterations the better.

In the end, Hashes may scale more linearly, if AND ONLY IF you do not hit an exception, and there are a number of those. Nested Loops grow faster, but tend to be smoother especially if the index/data is buffered.

Session 3: Selectivity and Hints
The rules governing selectivity start off relatively basic (although still a little bit of math involved), but quickly get complicated in my opinion. The boundary conditions really throw you. For join selectivity, one thing to be aware of is that Oracle assumes predicate independence. Jonathan has bantered this in his blog (where I first read about it). For example, consider two columns, 'month' and 'number_of_days' which tells the number of days in the month. If you use "where month between 10 and 12 and number_of_days = 30", Oracle assumes that any of those months can have 30 days. We know that is not true.

Jonathan had a great example calculating cardinality in a sample query; I'll not repeat it here for the time being.

One gotcha to remember (among many) is that when the optimizer generates numbers for you, it is not necessarily against the sql query you provided. In fact, you might as well assume it is NOT the same query; the numbers are for the interally optimized plan, of which only snippets are available to you via the 10053 trace.

We also covered transitive closure. Jonathan has a bit to say in his book and blog, and the bottmline is that sometimes Oracle will remove a predicate that it thinks is redundant (ie, a = b, b = c, therefore a = c). Lucky you. This can often lead to Cartesian joins, or alternative access paths being used (remember, generally speaking, access paths should have higher priority than filter predicates).

There are also a number of sanity checks to be aware of. One that Jonathan demonstrated is a case where Oracle will not choose ndv values from opposite tables when doing a multi-column join. I am still reading those slides, as it is taking me a little while to comprehend the ramifications. However, one of the issues seems to be that merely making a cosmetic change to your code opens it up to the possibility of hitting one of these sanity checks, and has the potential to throw the optimizer a huge wrench. Just remember that there are no certainties.

Jonathan also had some very strong words about hints. In fact, he even goes so far to say that "A 'hint' is a directive that the optimizer MUST accept and act upon." Now, as Jonathan pointed out, sometimes in an act of obedience, Oracle will ignore the hint. *grin* Yes, sounds contradictory. If you give a full(t) and index(t) hint, by "obedience" Oracle will consider the plans that both hints dictate (yes, BOTH hints), and will give you the cheapest plan. That is the key point. Oracle uses a Cost-Based Optimizer, and will always generate numbers to determine the cheapest plan. If you use hints, you want to use enough hints to trim out all other plans that the optimizer might possibly choose. The 10053 trace is handy in figuring that out.

He gave two examples to showcase this. One with a noparallel hint on the table; Oracle found a cheap plan using parallel on the index (think of a naughty two-year old). In another case, say you specify parallel on a table, but the plan is serial. That is because the serial plan is cheaper.

In another example, Jonathan shows the use_nl hint, and specifies two tables /*+ use_nl(t1 t2) */. This does *NOT* mean nest loop t1 and t2 with t1 as the first table. No no no. It means /*+ use_nl(t1) use_nl(t2) */. Yes indeed. So it tells Oracle to use a nested loop on t1 and a nested loop on t2, but it does not specify an order. In the example it seems to work because Oracle only came up with two plans, and the one we expected just happened to be the cheaper one. "Sometimes you get lucky". Sometimes, even though you did something wrong, you got the results you were expecting to see. Be careful about that.

As always, be wary of "exceptions". There is a hint (and underscore parameter) that specifically says "ignore hints." Go scratch your head over that one.

The ORDERED hint is funny in that Oracle seems to apply it in the end game of optimization. Jonathan specifically gave us a Hash example in which the Ordered hint was obeyed, but the plan was in the oppositive order. Apparently. Watch out for subqueries because remember that Oracle likes to unnest them, and might screw up your seemingly simple order. In 10g, we can use the LEADING hint instead. Still have to watch out for query block names that are dynamically generated due to optimizations; it is possible to specify non-pre-existing query blocks if you know how Oracle is going to name them. If you do not, your order may be a little different.

10g also makes it easier to logical name the hint operands. You can say "index(t1(id1))" or "index(@sub2 t4@sub2(t4.id)" Obviously, those are not physical names, but labels. Jonathan has observed that it is easier to provide a negative hint than a positive one.

Bottom line, in order to use hints well:
• Set the join order
• Join method for n-1 tables
• Access path for every table
• Average two hints per table to do it well

Session 4: Telling the Truth
Last day, last session. =) Need I say more?

Some interesting things about this session. Jonathan gets into some strange anomalies and how they are avoided if you give more, seemingly irrelevant, information. For instance, defining columns (esp. mandatory columns) as "not null" helps the optimizer tremendously when using not null predicates.

The stats play a big role in "telling the truth". Old stats are just as bad as bad stats. However, Oracle does provide a way to synthesize stats if needed; I found this whole concept quite intriguing when reading his book and papers earlier. And it comes back to what is your "interesting" data? What data does the application/business like? Oracle has several rules for dealing with "normal" data, but it is entirely possible that you are interested in anything but. Which makes it important to identify the "important" data, and take advantage of the tools that Oracle has given us to help convery that to the optimizer. Stats, and possibly histograms.

Jonathan went back to the example for clustering factor. It is entirely possible to have data that Oracle thinks is scattered, when in fact it is merely grouped weird. That is a case of "lying" to the optimizer.

For the most part, the default stats work. Oracle CBO assumes a normal, boring distribution of data, so if that is the case, you are all set to go. When the data is grouped into "weird, wacky and stretchy bits", the stats and histograms may have holes. Sometimes is still does a fairly good job to compensate, and sometimes you just get unlucky. =)

He had an interesting blurb for histograms in OLTP. He is suggested that instead of letting the normal stats collect histograms, the front-end should be aware of "important" data, and code for it appropriately. Can you imagine Banner doing that? Neither can I.

I will have to wrap it up with that. I will try to over these notes from the Days with Jonathan Lewis, but I need a break.

## Thursday, November 08, 2007

### Day 2 with Jonathan Lewis: Writing Optimal SQL

"Writing optimal SQL" seemed a bit ambitious for this day's class. I get the feeling that Jonathan has a wealth of knowledge and it is hard for him to condense it into a mere 6-hour time slice (4 x 1.5 hour sessions). Here are my highlights and thoughts from the material covered today.

Session 1: Background
Jonathan started out by reiterating (for emphasis) the obvious difference between "local" optimization (or 'optimisation' as he writes) and "global" optimization; sure, we can make an individual query go faster, but at what cost? Which brings us to the point that one always has to make a choice on where one spends one's resources.

Today's session concentrated on global optimization strategies (any of which could be used locally if you think about it), which comes down to always keeping your eye open for wasted resources, whether it be expensive CPU operations, unnecessary buffer gets or multiple visits to disk. As with all things in life, there are exceptions, today labeled as Contradictions. For instance, sometimes to increase logical I/O, one must increase physical I/O, and vice-versa. Jonathan gave the example of a typical index read. Say it takes 1 header block, 1 branch block and 4 leaf block reads to get all data. It is possible that the same data could be gotten with a FTS of 3 block read requests. Which one is better? What's in cache?

For the most part, the most efficient (and hence "best") way to get data is to got it precisely. This goes back to the little note yesterday about using access predicates in favor of filter predicates; if you can get only the data you want on the first pass, there is not need to filter data later on (and thus discard data gotten). And even though indexes usually help us target precise blocks (via rowids), they are balanced out by the fact that they generate overhead for DML.

Jonathan had a very interesting section on column ordering in indexes. There is a widely-held thought that you put the most selective columns at the front of the index. Overall, the focus needs to be on how dense the data retrieved from the index is, which may require that you put the most repeated column first in the index. Jonathan states that, "all other things being equal", you generally want to put equality predicates at the front of the index, range predicates at the rear. Be as precise as possible. That word is going to come up again and again.

Which got me to thinking. How does one determine what "business critical" data is? What is the "interesting" data? This question is made more convoluted in a large ERP system. Surely, it is not practical to go table by table to figure out what data is most often used, most often needed. And I have this sinking feeling that the vendor is not too up-to-par on an answer for this particular question either.

There was mention of index-(hash-)join and and-equal joins; I am not very familiar with the latter and am making a note to learn more about it.

We got to talking about bitmap indexes. Bitmap indexes are great for static or "mostly read-only" tables; bad for OLTP systems. To bridge this gap, Oracle has made Bitmap Conversions more easily accessible for B-tree indexes.

Again, Jonathan made mention of using dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'). I want to do more research on that; for it seems like an awfully wonderful tool to use. Must enable row source statistics (again, there is a very handy query-level hint, /*+ gather_plan_statistics */).

Jonathan also showed us some very interesting queries for metadata analysis.

select blocks, count(*)
from (
select /*+ index(t1 t1_pk) */
my_id,
count(distinct substr(rowid,1,15)) blocks
from t1
group by my_id
)
group by blocks
order by blocks
/

This query counts the number of blocks for each value (knocks off the row number from rowid to get file id + block id).

select /*+ index(t,"T1_I1") */
count(*) nrw, -- number of rows
count(distinct sys_op_lbid(49721, 'L', t.rowid)) nlb, -- number of leaf blocks
count(distinct hextoraw(sys_op_descend("DATE_ORD") || sys_op_descend("SEQ_ORD"))) ndk, -- number of distinct keys
sys_op_countchg(substrb(t.rowid,1,15),1) clf -- clustering factor
from "TEST_USER"."T1" t
where "DATE_ORD" is not null
or "SEQ_ORD" is not null
/

Talk about a little bit of black magic! I get the impression Jonathan ripped this out of a 10046 trace for calculating index stats, based on some of his non-standard uses. The last argument of sys_op_countchg should be the number of freelists; 2 through 16 if using ASSM.

There was a great word about documentation, especially since we have embarked upon a broad documentation project at work. Among other anecdotes, there was a good observation that "The harder it was to fix, the more documentation you need".

Session 2: Basic Coding
I did not like this section that much. Perhaps it was just me. But it seemed like a lot of the "standards" are quite subjective; is Jonathan teaching us his own peculiar style of coding?? =)

I did come away from this session with a few things, however. Due to Oracle's mostly-successful (dare I say, Mostly Harmless?) CBO, one can usually write a pretty intuitive sql and have it optimized properly. Sometimes, when we try to get clever and optimizer it ourselves, we botch it, or duplicate the work that the CBO would have done anyway, at the cost of making the code even more obscure and hard to read. As Occam's Razor infers, Keep it simple.

As popular as analytics are becoming, they are most definitely not the end-all be-all. They are meant for number crunching on small data sets.

Jonathan pointed out a rather bizarre observation of subqueries; Oracle assumes subqueries will reduce the rows to 5% (not BY 5%, but TO 5%). "This is insanely stupid. Sometimes." Yes, Jonathan qualifies everything he says. Usually.

Session 3: Sundry Methods
To kick things off, we dove into examine the costs of Nested Loop (NL) joins vs Hash Joins. Essentially:

Cost of NL = C1 + N * C2
Cost of HJ = C1 + C2

For a nested loop, C1 is the cost getting rows from table 1, C2 is the cost of getting rows from table 2, per row from table 1. Hence, if we can keep either N (the number of rows gotten from table 1) or C2 down, the total NL cost is relatively low. We can keep C2 low by using indexes and/or getting at the data precisely.

For a hash join, we are doing full table scans. We hash the join columns (and store the hash results in memory), the use the second table to probe the first, hashing the join columns on the second table to see if they match the hashed versions of the first. Note that the growth of cost is linear, whereas for nested loops is not (multiply by N).

Another problem with nested loops is that Oracle assumes that the tables are statistically independent. Jonathan (and others including Alberto Dell'era) have made a number of observations about this online.

One thing to keep in mind about hash joins is that it greedily gobbles memory. Do not try to make a large table the first table of a hash join.

Another important part about writing sql is that it helps to have a clear description of what exactly the query is to do, and to have a picture or diagram to further enhance the description. Sometimes a query is not optimal merely because of the way the "english" description implied how the query get data. There is more than one way to skin a cat.

Jonathan used an example of a "typical" FIFO index supporting a workflow process, in which a status is pushed through several states. The index will eventually have a long tail of mostly empty blocks, as all the activity is on recent timestamps. To address this issue, it helps to 1) only index relevant data instead of empty of null data using a function-based index, 2) have multiple smaller indexes. The benefit of #2 is that it is much easier to coalesce it, thereby keeping it well maintained. While talking about FBI, Jonathan quipped:

Encapsulation is wonderful. Hiding things is naughty.

Repeating a horrid modern-day developer practice of pushing procedural logic into sql, Jonathan touched on filter subqueries (where you have one subquery that handles one case and another subquery for another case). Jonathan made mention of the lnnvl function, which stands for "logical not null value"; a good way to test the value of a potentially null value.

He also talked about several "what if..." scenarios. In these, he states things like "if it is cheaper to get into table 2 through table 1...". I am not quite sure what he means by this. Something I need to follow up with him about.

We also covered the "untunables", or particular queries that did not lend themselves to being optimized by Oracle very well. Jonathan went to details to say that Oracle prefers and forces queries into a "left-deep" tree. Some queries actually perform much better when the access paths and join order is not "left-deep", but instead "bushy tree", or more balanced. This was showcased by a query with a 5-table join, with entries at either end. A "bushy tree" works towards the table in the middle, thus reducing the rows needed to propagate predicates and join conditions.

Session 4: More Methods
Jonathan showed us a fairly simple example of query rewrite using dbms_advanced_rewrite.declare_rewrite_equilvalence. This really does raise some questions; like, how different can source_stmt be from destination_stmt? Does it strictly follow the same rules as cursor sharing? Or can you completely rewrite the query as long as the columns are of the same datatype?

We also looked at subquery caching. I had a great talk with Ric Van Dyke (awesome dude!) afterwards about this. I was curious if there was any way to see or monitor subquery caching in affect, other than attempting to guess at its potential use via results in an explain plan. He did not know of a way, but guessed that it might possibly be exposed somehow in one of the x$tables. I find subquery caching to be quite interesting, if for nothing else that it can really play havoc with your explain plan and make them unexplainable unless you really know what is going on. Surely, there is a way to tell when your plan used subquery caching, and to what degree. Jonathan showed us "update join views", which I thought were quite clever. I have used dynamic materialized views (using the "with" clause, or "out of line views") to generate a view from a table join for an update, but having an update join view seems much more elegant. It does have some strict requirements; the driving table must be key-preserved, meaning that the column we want to update is deterministically joined. We looked at pagination, aka judicious use of rownum, to efficiently get top-n rows. Another related trick (for more complex situations) is to manipulate index rowids directly. This happens when the logic requires that you get and sort all data prior to using rownum. One can do all sorts of sorting and aggregation, returning index-only columns and the rowid, then take the rownum from that and access the table using the relevant rowids. Quite clever that. We took a brief glance at OEM, and moved quickly to profiles and outlines. Cool stuff. When talking about profiles, Jonathan made it a point to implore that we ask our Sales/Tech Reps when (NOT if) opt_estimate would be documented and supported. Profiles are essentially a series of opt_estimate hints that alter existing statitstics. How cool is that?!? =) Jonathan ended on a note about hints. In general, hints are bad. Don't use them. But sometimes you have to, and when you do, make sure you do not hold back. A lone hint on a 7-table join is next to useless; it is better to provide at least 6 hints to determine the join order, plus anything else to determine the query plan you want. Or to put it a different way, and average of 2 hints per table, 1 to determine join method, one to determine access path. Jonathan had some interesting things to say about hints, his seemingly prolific use of them notwithstanding: • They are badly documented • Nobody knows what they do • They change usage occasionally Jonathan really likes to say that "sometimes you stop getting lucky". Meaning that sometimes you use something (a hint) incorrectly but, lucky you, it happens to work the way you think it should. When it "stops working", it is actually working correctly and you just stopped getting lucky about it. That is about it for today. My ass is really hurting after sitting in those chairs. ## Wednesday, November 07, 2007 ### Day 1 with Jonathan Lewis: Execution Plans Here is my rough draft from notes I took during the first of Jonathan Lewis's 3 1-day CBO classes. These are merely highlites, for I could not possibly be comprehensive. Session 1: Finding execution plans If a database has been upgraded from 9i, the plan table may be of an old version. Additionally, it will not be global temporary table. You can run @?/rdbms/admin/catplan.sql to set it up properly. This also creates a public synonym and grants acces to public. When tracing, helfpful to use query block names (qb_name) in sql to identify later on in trace file. The 10132 trace is new, and provides a great alternative to running the 10053 trace (much more condense and practical). Great for baselining all query plans when set at a system level. tkprof is a little tricky - traceonly does NOT mean that the query is not run; it only eliminates the resulting output of a query. Hence, inserts, updates and deletes still execute, but a straight select will not do much (no output). Can format autotrace formatting with _plus_exp columns. The pipelined dbms_xplan.display function is very nice. There are several other functions in that package I was not aware of, like display_cursor, display_awr and display_sqlset. display_cursor is great for getting the plan of a cursor currently in memory. Since the function is pipelined, it can be used in a table join operation. Jonathan showed a greate example finding the query plan for all cursors that contained a FULL TABLE SCAN. Jonathan's tests with reporting sql also showed easy ways to implement query-level optimizer paramters via hints (ie, /*+ opt_param('parallel_execution_enabled','false') */). Rowsource execution statistics can be generated by setting statistics_level = all, or more surgically with _rowsource_execution_statistics = true. Can also be set with a hint (/*+ gather_plan_statistics */). What I really like about rowsource execution stats is that you can easily generate Estimated and Actual stats, and can use something like Wolfgang's Cardinality Feedback to get a fairly good idea where performance problems may be popping up. The "ALLSTATS LAST" parameter of dbms_xplan.display_cursor displays this quite nicely. I need to do some more research on this function so we can use it more readily. Great tool for developers and DBAs alike. Finding cursors in v$sql is made efficient with hash value and sql address. Unfortunately, 10g statspack has an "old" hash value that is inefficient. Do not use it.

When looking at parsed cursors, especially via sprepsql, the version count can be quite useful. Since the stats are aggregated, you have to be careful about summarizing and translating what the stats actually are trying to say about the query.

Any "explain" may be untrue.

Like House says, "Everyone lies."

Section 2: Interpreting Execution Plans

The good news about reading explain plans is that it is (should be and can be) much easier than it might first look. For the most part, it comes down to the good old standby, "Divide and Conquer". To that extent, Jonathan has two basic rules when breaking down plans, which essentially follow the concept of finding all your child cursors until you run out of descendants (think left tree-walk). The bad news is that there are, as always, exceptions. But the rules cover a vast majority of plans. And they are simple.

There are no complicated execution plans, only long execution plans.

First off, it really helps to look at the query plan with some aids. The PID (Parent ID) column is essential. Having the children indented incrementally also helps.

RULE 1 - Single-Child Parent
The child follows the parent and feeds the parent.

RULE 2 - Multi-Child Parent
The parent is an operation that determines how the children's results are combined.

All explain plans are a combination of Rule 1 and Rule 2. So what you do is break it down using those rules. That is what this section does; I'll leave it to you to go through Jonathan's slides yourself. =)

Rule 1 is relatively free of tricks. Rule 2, however, is a lot more convoluted when you start looking at the various different operations, and the various ways those operations are implemented. So while it may be "easy" to group them together, interpreting them takes a little time to understand the myriad subrules. However, the context usually helps that a bit. For instance, if you see a BITMAP AND operation, you are likely to see BITMAP CONVERSION operations; knowing what an operation is expecting (ie, ROWIDs, bitmap numbers, etc) helps you figure out what the child operation is passing up. I did not take many notes on this section because the slides are quite informative and easy to grasp once you see them.

When talking about filter predicates, I learned that access predicates are much more preferred if they can replace filter predicates. The reason being that filter predicates usually happen after you get the data, while access predicates determine how you get the data. This can be showcased by an inappropriate range scan; if you have an index with 3 columns and the query only specifies column 1 and column 3, the optimizer has to do a range scan for column 2. If the query can be adjusted to included column 2, the optimizer no longer has to do a range scan, and can move the filter predicate to an access predicate. Basically, without column 2, the range of "interested rows" in the index is bounded by the smallest value of (column 1, column 3) and the largest value; a range which may (quite possibly) include values that you are not really interested in (and have to be FILTERED out). If you define all columns, you only grab the rows you want in the first place.

select * from t1 where id = nvl(:b1, id)

"The ideal is to go back and smack someone."

The execution plan has to include a path for both :b1 is null and :b1 is not null. Instead, the application should have two different queries for these conditions. Additionally, the above code does not catch rows when id is null.

Jonathan stressed the idea of watching your query plans to see if the optimizer uses any unusual enhancements, so that one may learn from them. For instance, Oracle has a MIN/MAX optimization for indexes. This is especially useful for correlated subqueries that can be optimized by (redundantly) copying a predicate from the parent query that just happens to contain extra columns of an existing index. Sometimes work will be completely removed, like in the case with ORDER BY elimination.

Section 3: Problems with Plans

So now that Jonathan has razzle-dazzled us with all this awesome stuff, here comes some bad news.

• Explain plan may be telling lies
• Trace files don't always have stat lines
• tkprof might not dump the stat lines even if they are in the trace file
• tkprof may dump the "wrong" stat lines
• v$sql_plan might be out of date • v$sql_plan can have holes

Explain plans are not aware of the datatypes for bind variables. Hence, if you are trying to get an acurate explain plan, it is a good idea to explicitly declare the datatypes of bind variables with a cast or to_ function.

Jonathan made some interesting comments about "good" applications. In particular, he stated that they do not close their cursors. We have been telling our developers to close cursors. Why is that? I think the statement needs to be taken in the context, or at least the assumption, that the cursors will be held open for the intent of reusing them, opposed to forgetting to close them when not needed. When cursors are not closed, their stat lines are not dumped to a trace file. Hence, if you want plans and statistics for a unclosed cursor, have to get them out of v$views. In cases where a cursor has multiple versions and/or executions, tkprof will only report on the first one in the trace file, even though it summarizes the stats for all of them. It is entirely possible that one bad execution caused a majority of the "bad" stats, and that the explain plan in tkprof is for a completely different execution. While showing us some sql to demonstrate subquery factoring, he showed us his materialize hint (/*+ materialize */). There is also a inline hint that does the opposite. Scalar subqueries are slightly problematic in that they are not costed properly, due to the fact that Oracle does not know how many times it will run. Jonathan talked about "hash table cache" and "scalar subquery cache", both of which I am quite unfamiliar with, and want to research them a little more. Oracle will automatically cache hashed results of a scalar subquery so that it does not have to execute the subquery over and over. However, if two values of the subquery happen to hash to the same address, the first one in wins, and the second value has to be calculated again and again and again... Additionally, Oracle has introduced "order by" elimination with 10g. Unfortunately, sometimes "order by" is quite useful when dealing with scalar subqueries and thus one may need to use the no_eliminate_oby hint. There is another hint that can be used to swap the join inputs of a hash join (/*+ use_hash(t3) swap_join_inputs(t3) */). 10g also allows use of the no_swap_join_hint to further fine-tune exactly which table you want to be thrown into memory as the "first" table. Section 4: Advanced Features The MIN/MAX optimization that we talked about earlier has another nice little feature. It is not a filter predicate, but rather an access predicate, meaning it gets exactly the one little piece of data that it wants and nothing else. Have to watch query plans. If you expect an optimal subquery but see a filter operation for it, you may want to re-evaluate it. Remote operations are quite an interesting phenonemon when it comes to query plans. Although remote statistics are fetched, they are not fully used when generating a query plan. For instance, a 3-table join may be ordered such that a remote table is joined to a local table, which is then joined to a remote table. How much sense does that make? In most cases, the query will perform better if remote operations are done in as large a chunk as possible (and reasonable), instead of making many small trips back and forth. Parallel operations are quite unique. Contrary to popular belief, they are not always a good thing. Parallel query is designed to be brutal and inefficient. I believe Jonathan said this because the idea of parallel query is to use up more resources (ie, more CPU) than a serial query would. In fact, it attempts to max out the CPU in an attempt to get the query done as fast as possible. I think he even said that parallel query should never be used in an OLTP database, but I did not write that down, so I could be wrong about that. Anyway, parallel query has to be carefully tuned, for it is far too easy to thrash your system. Looking at an explain plan and trace files, we have to be careful about what we find. The infamous "idle" waits associated with parallel operations may indicate problems, or the may not, it depends on where they are happening. If an explain plan shows many P->P (parallel to parallel) operations, that could easily be a source of contention. Much better to have PCWP (forgot to write the definition) operations. In the case of a HASH join, you do not want to broadcase a large table to all the parallel query slaves; instaed, broadcast the small one (for memory) and read from the disk directly for the larger table (and joined locally). For nested loops, it is better to join and aggregate smaller chunks at a time, thus reducing "an entire layer of slaves". Near the end, I had trouble keeping up with my notes (both because my butt was sore from the hard chair, getting tired, and being overwhelmed with so much information). We moved on to partitioned queries, and talked about the benefits of being careful when designed partitions and how to take advantage of parallel query. To fully realize the potential of partition-wise joins, it is best to partition early in the plan so that correllated, similar work can be parallelized. On a slightly off-topic of partition statistics, Jonathan mentioned: You ought to be faking your statistics. Sometimes. Especially true since subpartition statistics are not dealt with properly at all. And finally, if you see a "PX SEND BROADCAST" in your query plan, that automatically means you are not doing any partition-wise joins. They are mutually exclusive. One last quote from Jonathan Lewis: You stop getting lucky In reference to performance problems that happen all the sudden even though "nothing changes" ## Friday, October 26, 2007 ### I hate comment spam I finally made it to the state where I qualify for comment spam. GRRRR. Where is the "do not call" list? My apologies for having to turn on capcha. ### Redo log gaps with RAW disk: There has got to be an easier way... Of course, I admit right off the bat that we are still new with RAC, ASM and Raw disks. I have been asking around trying to find out what the "common practice" is when resolving the redo log gaps between a primary and standby database, specifically when using RAW disk. I think RAC just makes this scenario a little more complicated by the fact that you have two redo streams you have to keep an eye on. So, here is the beef of my complaint. Let's say your archive log destinations are explicitly set for the standby SERVICE, letting the other destination "default" to the db_create_file_dest, which just happens to be the +DATA diskgroup under ASM (a RAW disk). As an aside, I am curious why it does not default to the db_recovery_file_dest, but that is a post for another day. Let's say your standby is down for whatever reason, and eventually you want to bring it back up. Oh yeah, there is no Data Guard, so forget about "automatic" gap resolution. How are you going to copy the archive logs from the primary to the standby? There are no copy commands in ASM. At least, not for copying from a diskgroup to a device managed by a different file system, say, a cooked filesystem. One could use dd. But I would have to do a lot of reading to figure out what the offsets are, not to mention that the archive logs will most likely not be nicely laid out in contiguous blocks that you can gobble up with one dd command. The only option is to use RMAN. RMAN stands for Recovery Manager. You are going to use the Recovery Manager to copy files. Oh joy. I am still searching for the best way to do this, but this is what I have now. Mostly for my own reference. =) First, find the last log sequence number that was applied to the standby. Add 1. Call this minseq. If you have multiple instances, call this minseqN, where N = the instance number. On the primary site: rman target / RMAN> backup archivelog from logseq :minseq; Again, if using multiple instances: rman target / RMAN> backup archivelog from logseq :minseq1 thread 1; ... RMAN> backup archivelog from logseq :minseqN thread N; [ for now on, I will assume one instance because it is easier to type ] Next, restore the archive logs to a cooked filesystem. In this case, I am going to use a standard ufs3 mount called /u02/backup. RMAN> run { 2> set archivelog destination to '/u02/backup/'; 3> restore archivelog from logseq :minseq; 4> } Your archive logs are now "normal" files that can be accessed with normal OS commands. Copy them over to the standby site (we use scp). Wherever they end up on the standby site, I find it helpful to make a listing (including full path), one file per line. You will see why in a moment. On the standby site, recover the standby database: SQL > recover automatic standby database; When you are prompted for a archive log, select the appropriate member from the list you made; you can copy the whole line and paste it into the prompt. Repeat until you cannot find any more entries on your list. And Viola! You have resolved the archive log gap. Finally. Some notes If you know of a better way to do this, please let me know. =) I would love to do the whole thing from RMAN, but I have not figured out how to recover from specific archive logs; I keep getting the unhelpful "RMAN-06054: media recovery requesting unknown log:", even if I specify a archivelog destination. Obviously, I need to learn more about this aspect of RMAN. I left out a small fact; we are not using a recovery catalog for this operation. Why not? In retrospect, it probably would have made things easier. But we are migrating away from a recovery catalog, so that is what we started with. If it turns out that using a recovery catalog is going to help us significantly, I would make a case to implement it. I believe it would have resolved the RMAN-06054 above, but I am not yet positive about that. ## Tuesday, September 25, 2007 ### RAC: "How not to design an application" We have a new Luminis database that was being tested for an upgrade (4.01 or something like that). We found that we were constantly maxing out db_recovery_file_dest_size (database hangs, look in alert.log). The following is what I sent to my team members. In RAC class, Andy warned us about the redo logs and how they can be a source of contention for some very basic things. Let us use LUMDEV as an object lesson. Luminis has this little heartbeat function (I do not know the official name or the official purpose, but this is my observation) that updates 1 row of 1 table; a table with only 1 row. It does this continually. Better yet, this is execute not from one session, but from 7 sessions! So you have 7 sessions banging against this 1 row of this 1 table all the time. Obviously, the data in that row is not meant to stick around a long time. Usually, in a normal database this is not such a big deal. Might cause some waits because of an exclusive transaction, but nothing that causes the rest of the database to wait or contend. In RAC, the world changes drastically. Andy taught us that each instance masters its own list of global cache locks. These locks exist for every single block in a database (you can imagine what a nightmare it is for large databases). The list of master block locks are split up evenly between all participating nodes (and must be remastered when a node is added or deleted). These locks are often upgraded and downgraded (exclusive and shared), and often sent across the private interconnect (fancy word for NIC) to other instances. When an exclusive lock is requested on a different node than what is already holding the exclusive lock, the lock must first be downgraded, then re-upgraded for the requesting instance. Every time a lock is downgraded like this, the redo log buffer must be flushed. See where this is going? You have 1 table with 1 row, meaning one block. Because the application is repeatedly requesting an exclusive lock from both instances, the global cache lock is constantly downgraded and upgraded. This produces many many many many many many archived logs for all those log buffer flushes. LUMDEV can easily consume over 18gb of redo space on this one operation alone. Crazy, eh? On a similar thread, I am not all that crazy about the dual management of both the ASM FRA diskgroup and db_recovery_file_dest_size. Here is what I posted to the ASM Oracle Forums: For those of us who run multiple databases on a box with shared disk for FRA, I am finding the extra layer of ASM and db_recovery_file_dest_size to be a minor inconvenience. The Best Practice white papers I have found so far say that you should use db_recovery_file_dest_size, but they do not specify how you should set it. Currently, we have been setting db_recovery_file_dest_size rather small, as the databases so far are small and even at 3x the database size, the parameter is still significantly smaller than the total disk available in that diskgroup. So, my question; is there any downside to setting db_recovery_file_dest_size equal to the total size of the FRA diskgroup for all databases? Obviously, this means that the amount of free space in the diskgroup may be consumed even if db_recovery_file_dest_size is not yet full (as reflected in the instance V$RECOVERY_FILE_DEST). But is that really a big deal at all? Can we not simply monitor the FRA diskgroup, which we have to do anyway? This eliminates the need to worry about an additional level of disk management. I like to keep things simple.

The question is relevant to folks using other forms of volume management (yes, I know, ASM is "not a volume manager"), but seems germane to the ASM forum because most articles and DBAs that I have talked to are using ASM for FRA.

Most importantly, what ramifications does "over-sizing" db_recovery_file_dest_size have? Aside from the scenario above.

I am really toying with the idea of making the recovery destination super-large and relying on the management of the FRA. To check space issues in the FRA, I can use asmcmd (the "du" output sucks, IMO) or query against the instance itself:
+ASM1_SQL > select FREE_MB, trunc((TOTAL_MB-FREE_MB)/TOTAL_MB,2) "% Space Used" FROM V$ASM_DISKGROUP2 where NAME = 'FRA';FREE_MB % Space Used---------- ------------ 43426 .37 I am still looking for a way to generate a "du"-like report for all databases in the FRA. In the database, I can check the amount of space in the recovery area: SQL> select (SPACE_LIMIT-SPACE_USED)/1024/1024 "Space Left (MB)", trunc(SPACE_USED/SPACE_LIMIT,2) "% Space Used" from V$RECOVERY_FILE_DEST;Space Left (MB) % Space Used--------------- ------------        18624          .04

I would hate to have to login to each database to find out how much recovery space is left before the database hangs.

UPDATE:
Arul made a most informative and excellent response:

Basically, set your db_recovery_file_dest_size appropriately, and you will not have to worry about the ASM diskgroup at all; let Oracle maximize/optimize the space when the parameter set to size that tailored to the needs of the database.

## Wednesday, August 22, 2007

### New blog buddies

For some reason, blogger is not letting me add new elements to the blogroll. And I do not have time to ferret out the issue.

Anyway:
Dan Norris
Jeremy Schneider
Kirk McGowan

Seems like there was one other....

## Tuesday, July 24, 2007

### RAC is not easy

I have a good friend to whom I like to complain a lot (sounds like my good friend is getting the short end of the stick, doesn't it?). With good intentions, he sincerely suggested that RAC is easy. I put in a fence, working 43 hours over a 4-day weekend. Shall I call that easy? It only lasted 4 days. He is still a good friend, he just thinks everything is easy. *grin*

K Gopalakrishnan once said:
Believe me RAC install is very simple and straight forward if you religiuoly complete the pre requisites.

Ok, enough of the griping. My hopes in posting up some of our specific dilemmas is to document what we are doing, since so often I try to google for specific errors and never find anything useful (I am not counting the Chinese sites, because even after Google translates them, I am not sure I see any resolutions apart from the questions).

So, first problem. I am following Metalink note 357261.1, very religiously. That was a mistake, as it is not complete, yet. Anyway. You attempt to remove asm:
srvctl remove asm -n urbdb1,urbdb2

You check to make sure it was removed:
srvctl config asm -n urbdb1
+ASM1 /u01/app/oracle/product/asm

It is not, so try again:
srvctl remove asm -n urbdb1
PRKS-1033 : Failed to remove configuration for ASM instance "+ASM1" on node "urbdb1" from cluster registry, [PRKS-1023 : Failed to remove CRS resource for ASM instance "+ASM1" on node "urbdb1", [CRS-0214: Could not unregister resource 'ora.urbdb1.ASM1.asm'.]]
[PRKS-1023 : Failed to remove CRS resource for ASM instance "+ASM1" on node "urbdb1", [CRS-0214: Could not unregister resource 'ora.urbdb1.ASM1.asm'.]]

What do you do?

Oracle Support has told me that crs_unregister is buggy and not supported. *cough cough* But I am going to attempt it anyway, since Bill Wagman had some luck with it (if you follow the discussion from oracle-l, you will see that Peter McLarty suggested it).

/u01/app/oracle/product/crs/bin: crs_unregister ora.urbdb1.ASM1.asm
CRS-0214: Could not unregister resource 'ora.urbdb1.ASM1.asm'.

/u01/app/oracle/product/crs/bin: oerr crs 214
214, 0, "Could not unregister resource '%s'."
// *Cause: There was an internal error while unregistering the resource.
// *Action: Check the CRS daemon log file.

Grrr... How quaint, check some log file somewhere on your system, and that will solve all your problems. Having no idea where my "CRS daemon log file" actually is, I use RDA to browse around and finally come up with /u01/app/oracle/product/crs/log/urbdb1/crsd/crsd.log. Unfortunately, the CRS daemon log file is not helping me much. What am I looking for?

Update: 5:19 pm
After a day of reading manuals and discussing options with the fine folks on oracle-l, we still have the same problem, albeit now I have quite a few new tools on my belt. Yes, RAC is not easy, I think I have proved that. To be a little more granular, working with the OCR is a pain in the butt.

So, new tools.
• strace: a very low-level OS trace utility. I did not benefit from this, but I was able to show the output to others smarter than I. I used it on srvctl and crs_unregister.
• The "force" flag (-f) of certain commands, like srvctl. I believe it removed something, but I do not know what; I still have my root problem.
• Appendix A of the Clusterware Deployment and Admin Guide: has a ton of information, most of which would probably be helpful under "normal" circumstances. Did I mention we still have our root problem? However, I have to give credit to the authors, for they did a great job. There is a lot of information about log file locations (wish I knew about that earlier), how to debug various components and resources, and some descriptions of the syntax used for commands. I thought the OCR section was quite thin; perhaps I am biased because I am looking for a specific solution.
• SRVM_TRACE=TRUE: This is documented in the above Appendix A, but I point it out because it spews out a bit more information. While not immediately helpful, it seemed like something that I should file away.
• USER_ORA_DEBUG: mentioned one time in the Appendix, I found out that you could crank this all the way to 5. I have no idea what it does or what the appropriate values are - google is not giving much on it, yet.

More to follow. My Support Analyst just requested that I reboot the nodes because he has no idea why the resources exist in the OCR, as ocrdump does not list them.

Update: 11:22 AM, Wednesday
LS Cheng on oracle-l pointed out what ended up being the winning goal.
crs_stop ora.urbdb1.ASM1.asm

I still do not completely understand why this is an issue. Or even how one determines that is the solution. I hope to hear back more from LS Cheng so we can understand how he arrived at that conclusion.

What we ended up doing was restoring ocr to a point before we attempted to follow note 357261.1. Since the services were already down, it was straight-forward to delete the databases, the ASM instance and finally the ASM database. I was actually surprised it worked so well, given all the problems and headaches we had yesterday.

I will add another update when I learn more. Right now we are happy we have a RAC install back in working condition and can move forward with our projects. Oracle Support did not score any points in this round. This is becoming a bad trend.

## Monday, July 23, 2007

After a week of vacation, I came back to work to wrap up some Service Requests with Oracle Support and query for a status on others. During that process, I was made aware of a "new" vehicle for patch delivery, documented in Metalink note 438314.1. Apparently, Oracle is trying something new with NApply CPU patch; it sounds promising, and I sure hope it successfully addresses the issues from past CPU patches and the difficulty with merges, as proudly declared in the note. What I do not quite understand is that there are still going to be Super Merge patches; we have a particular bug that did not make the April CPU, so it was being worked on for the July CPU. It missed that, so we were told it was going in the July Super Merge patch. How does NApply deal with that?

What I am most confused by is the claim in the first set of bullet points:
Elimination of the CPU Merge Patch as a vehicle for patch conflict resolution

What exactly does that mean? Obviously, there will still be merge patches; just read further down in the note and you can see how a customer has to report a conflict, and Oracle will supply a merge patch. That is the documented procedure. Perhaps what Oracle is trying to say is that they will not ship a CPU that has your particular patch merged in, but rather a small little merge fix (aka "bandaid") that will allow the existing/original CPU to be applied. That sounds like a good thing. I remain skeptical and want to see it in action, though. *grin*

And on another note (literally), I see that the draft 10.2.0.4 bug list has been published. 1329 undocumented bugs!! Aside from all the "documented" bugs, we are hit with 1329 bug fixes for which we, as a customer, have no clue. 1329 is a big number. How is it that there are that many bugs in the 3rd patchset of its 2nd major release of its newest version of the flagship software? I am baffled.

I long for the day when Oracle has to post a page on OTN that says:
We apologize for the lack of patchsets in the past year; there simply has not been enough bug fixes/patches to justify a large patchset

And to beat another dead horse, how is it that all those bugs are "undocumented". Isn't there a more appropriate word for "fixes and patches for bugs that we are not willing to share any information whatsoever about"? "Unpublished" would be a better word. And granted, I am not going to read through all the "documented" bugs, let alone 1329. I just have a problem with "undocumented" bugs.

Did anything significant happen in the year 1329?

## Thursday, July 12, 2007

### Kevin Closson's "Manly Men"

Kevin has a number of articles about "Manly Men", or rather, folks who do not know better than to deploy RAC on complex configurations utilizing SANs. Like me. I encourage you to check out his whole series (most recent is Part VII).

After reading his thoughts and corresponding with Kevin, I am compelled to question why we have gone this route.

### I hate creating SRs

I seem to create a lot of them. Oracle [Support] Configuration Manager is supposed to help with this. In my latest note to a Senior Customer Support Manager, it was recommended that I create an SR to resolve the issue where my Configuration Manager is not working correctly.

Just shoot me now.

I'll add more. Later. I am just feeling quite SR weary. What I really want is a really smart secretary who could initiate these for me. That way, the Oracle Support front-line could deal with my front-line, and we would all be happy. Right?

## Tuesday, June 26, 2007

### RAC: The big install

So, we are attempting our first "real" RAC install; not canned, not pre-fabricated, but using only software downloaded from OTN and following Oracle Documentation and various forms of cliff notes. This is one of those things that is really sweet if it works 100%. Otherwise, you are in for a headache. We have a headache.

That RAC class was good for teaching things, but it also perpetuates a false sense of security when things go wrong. And from what I can tell from all the notes and pleas for help out there, things go wrong often. One of the mantras I hear is to follow the documentation exactly! This is all good, but the documentation itself comes in many forms. Do you follow what Oracle has said, or do you pick some expert (K Gopal or Julian Dyke) and follow what they say? Cluster Verify (cluvfy) is also a little misleading; it will not check for 100% compatibility with the installed RPMs. In fact, I even had one Oracle Support analyst tell me that was the DBAs job. That is a lot to swallow. Take a DBA who does not know anything about Linux and tell him to verify that 24 RPMs not only exist, but are compatible with the required RPMs. I tried to write a script for it, but in the end, the only "failsafe" way to do it is by hand. I say "failsafe" because human error plays a large role in these RAC-related problems as well.

It would seem to me that one good way to eliminate, or at least reduce, human error is to automate. Dell IT has taken this to extremes and automates a vast majority of their day-to-day tasks. Checking the RPMs is just a small fraction of something that could easily be automated. What about user equilvalence? What about all those silly root scripts? Or running oracleasm to configure and create disks by hand? What boggles my mind is that 10g RAC does so much that is really cool and automated; when the sun shines, life is good! Why are some basic things left out, but you have some nifty tools like cluvfy that is really slick at verifying a good chunk of your install work?

Ironically, our CRS installation was hunky-dory. The rootpre.ksh was a bit weird (why is it checking for 9i CM??), and double-checking all the paths and homes is about the only thing that slowed us down. Things went south when it was time to install ASM. Our first warning flag was that the swap space was not big enough. Thinking it was a red herring, we ignored the warning. Later on, after the software was installed and the configuration assistants were running, we hit our first major roadblock; link not satisfied on njni10. Not much that seem relevant on google or metalink. Oracle Support told us to attempt the installation again. Now think about this; the analyst assigned to us specializes in NetCA (that is why we filed the SR). This guy tells us to simply re-install ASM. Having had ASM problems in class, I was not exactly happy about that. Remove Oracle Homes, zero out raw disks, make sure no processes are running, and away we go. This time around, ASM cannot see all the disks. So when I tell my support analyst that we have new problems, he has to bring in a database specialist because the original guy does not know anything about ASM. What a joke! On top of that, he "reminds" me to keep the scope of the SR to one issue. GRRR!!! Of course, we are subjected to the usual onslaught of new questions and request for an RDA. I am actively ignoring them. We were able to work around a large number of our problems, but in the end, we want to simply wipe the slate clean and start over.

Deleting everything and wiping the slate clean is not easy. No sir-ee. This is where having root privs come in really handy, because of someone's ultimately wishful thinking, the CRS Oracle Home is installed with root as the owner. By default, oracle does not have any privileges to remove or modify anything in the directory, and only limited privs to execute anything. For instance, running crsctl evokes a "not enough privileges" error. Not to mention the slew of root-owned processes (crs, css, emv) that have to be dealt with.

What fun.

On a separate note, we were supposed to have a webinar with our ERP vendor (SunGard Higher Education, or SHE as some say) on the topic of Oracle RAC. *cough cough* I went with the intention of mildly heckling them, but they had technical difficulties with the virtual presentation. Sounds like even putting the letters R-A-C on something is prone to make it break. *grin*

Seriously, though, I know we will not be moving towards RAC any time soon for our production ERP system, and I am very curious to see how other schools manage it. In a morbid sense, I am also curious if they are buying the line from some sales person about how it will help their system, or some form of HA. RAC looks great on paper, but after scratching the surface as I have, it ain't all that pretty underneath. Don't get me wrong, as I mentioned earlier, it does a lot of cool stuff, and it does it well. But there are two sides to that coin, so it would be wise to keep things in perspective.

## Wednesday, June 20, 2007

### ceterum censeo

No, I am not advocating that Carthage must be destroyed. I think someone already beat me to that anyway.

But I did want to encapsulate what I can "bring home" from the RAC experience I had in Chicago. And I am obligated to include a sumptuous feast at Fogo de Chão. Vegetarians may wish to avert their eyes for the rest of this paragraph. I had heard that the salad bar was really good, stocked with fresh veggies, side dishes galore, deli meats, seafood, etc. I glanced at it once, from my chair on the other side of the room, and I saw a lot of colorful food. The rest of my attention was systematically focused on obtaining delicious cuts of meat from the passing "Gaúcho chefs". However, the, umm.... "chefs" were not really "Brazillian cowboys" at all (obviously), but it certainly sounds more impressive than "waiter". And since that is the worst of the cheese, I could live with that. But the meat! Wow! Slabs of hot, aromatic sirloin, lamb, pork, chicken or sausage passed by our table at least once a minute. They ask how you well done you want it, and they carve it off. And this is not your typical beef buffet; these are high quality cuts that are grilled in ways that Outback can only dream of. After a while, one can narrow the choices of what you like and dismiss other selections. The gaucho guys just keep coming back, again and again. I treated my dad for Father's Day; I decided that is the way to do it. =)

So, is that a good segue to bringing home the "meat" of the RAC class?

First and foremost, RAC is not a panacea. Yes, it scores a lot of cool points, but do not let that fool you into thinking it will solve your problems. The Oracle Sales people are pushing it hard because it is expensive, not because of how well it helps you attain your goals. If anything, RAC is probably most optimal for a niche market, customers who have applications that are well designed for parallelism, or at the very least, DML segregation.

After you swallow that pill, most everything else is rosy. One can work around performance issues (ie, a customer wannabe who thinks he is in the niche market) by decreasing or eliminating bottlenecks. Where will our bottlenecks be? That is probably one of the hardest questions to answer at this stage of the game, because a portion of the application is still being developed. Keep in mind that our initial foray into this field will be via a highly visible, yet low-load, online program. So, here are some items I think we should start with, so as to avoid having to worry about them in the future.

1. As much as practical, maximize disk i/o for redo and controlfiles. Put them on independent, fast, highly available disks (perhaps DMX in RAID10).
2. Provide expansion capabilities for the interconnect, either by allowing more (multiplexing) or swapping in a bigger pipe.
3. In regards to the portions of the overall application that we have direct responsibility for, work hard to focus on making the DML either segregated or parallel. Do not merely copy Banner coding methods, which would have horrifying results in the long run.
4. Be generous with the buffer cache
5. We need to decide how we want to move forward with application failover. Is it good enough to implement TAF? Or do we go with the Cadillac of FAN (using ONS)? Personally, I think FAN is like asking a electromechanical physicist to invent an automated device to rapidly accelerate a 2mm thick, 4" x 4" sheet of nylon towards a living Musca domestica. Some people refer to that as a "fly swatter".

In the context of administration, I think our group has to prepare for the coming paradigm shift. We use Grid Control a little, mostly for monitoring a couple databases here and there. That is going to change with RAC, where all the "easy" tools are distributed in a not-so-user-friendly fashion throughout Enterprise Manager. Not only that, but we are going to have to get used the concept of connecting to an instance vs administrating the database. We will have to learn srvctl, crsctl, ocrconfig, crs_stat, .... you get the picture. RAC is not merely a database with two different memory regions running two different sets of background processes; we have Clusterware, OCFS2 and ASM to monkey with. RAC does not increase the number of headaches by a factor of 2. No, it is more ambitious than that. Try a factor of 10.

I wrote in an earlier entry that we should probably take advantage of Resource Manager. Emphasis on probably. As with any and all new technologies, we have to seriously consider our business needs, and determine if the business needs drive the requirements for said technology. I am of the opinion that any "business need" that is fashioned into a generic statement like "we need 5 9's" should be fed into a shredder, burned, dissolved in acid, boiled in plasma and sent on its merry way to Sol. Ergo, ceterum censeo.

With all the documentation we have (whitepapers from Oracle and Dell, reference manuals, purchased books, recommendations from consultants and special interest groups), I am confident that we will be able to deploy RAC using Best Practices. Deploying our application in such a manner is going to be a different story.

## Sunday, June 17, 2007

### RAC class, the last day

What a week. And what a way to wrap it all up. I had to take some time away from it all; the class, blogging, thinking about it... But I do not want to delay too long, because I want to commit this to another media before I start forgetting about it.

Friday, we covered the last 4 chapters in 4 hours. Or more like 3, if you consider we started at 9:15, took a couple breaks and ended at 12:30. There are a couple factors why we were able to rip through them so fast. You will see.

Chapter 9: Clusterware
Due to all the issues we had earlier in the week in our efforts to remove ASM from the OCR, I was looking forward to this chapter. Why the sections that covered various CRS (Cluster Registry Services) commands were a little light, the combination of us having to dive into this stuff blind and having a great teacher like Andy facilitated my understanding of the material. Plus, the chapter goes over a bit of architecture, which I find very conducive to laying the foundation for the "big" picture.

Andy started off by addressing a fairly common problem of automatic (and frequent) instance restarts. Since the most common root problem is slow disk, one needs to introduce slight timing delays, as shown on 9-38. Basically, for each resource in CRS, you can manipulate the AUTO_START (as), RESTART_ATTEMPTS (ra), and UPTIME_THRESHOLD (ut) parameters. The page suggests that you set these to 2,1,7d (respectively) for the instance, ASM and the database.

To help speedup the interconnect, aside from increasing the bandwidth, one can also increase he tcp/ip packet size.

Since the OCR and Voting Disks are so critical, it is suggested that they be placed on isolated physical disks, and to either mirror them on the backend or multiplex them up front. Also, the CRS automatically backs up the OCR (default location = $CRS_HOME/cdata/$CLUSTER_NAME). Curious that the Voting Disks are not also backed up at the same time. Curious also that the book recommends one use symbolic links for the Voting Disk path, which makes it easier to restore a prior version (since the OCR cannot be modified directly, and it stores the path to the Voting Disk.... how else are you going to change it?).

One of the biggest problems with this scheme is that the OCR has to be synchronized with your software. If you have a database that was recently upgraded and you wish to revert to the downgraded version, you have to also restore the relevant point-in-time version of the OCR to match. That sounds like a major headache.

Andy recommends that one make the OCR 200mb, the Voting Disk 20mb.

The rest of the chapter deals with some specifics involving the OCR and CRS, giving brief examples of using the crs_* APIs. Not much to go on, as mentioned earlier, but at least a taste.

Chapter 10: Clusterware and RAC Diagnosis
The VERY FIRST point in this chapter is to make sure that all your nodes are using synchronized time (ie, NTP). Let me say that again. THE VERY FIRST point in this chapter is to make sure that all your nodes are using synchronized time. Why did I repeat that? In our lab, none of the RAC nodes were setup with NTP. This is a RAC class. There is no NTP. What is wrong with this picture? Several students in the class (us included) were unable to complete the labs on the first day because of this problem. And remember, it takes 6 or 8 ostentatious hours to rebuild the machines. So keep that in mind, NTP makes a difference.

The rest of this extremely short chapter (like 10 minutes) focuses on CLUVFY, the "cluster verify" tool. It is very handy, very versatile, and I see lots of RAC experts out there using it in their documentation. Some other highlights from this chapter include a map of the clusterware main log files (yes, Watson, a map!); we are talking about 13 different log locations. Oh the insanity! There is also a perl-based diagnostics collection script which looks like it might actually be quite useful. It is located at \$CRS_HOME/bin/diagcollection.pl.

Chapter 11: Node Addition and Removal
I was looking forward to this chapter based on the problems we had with removing ASM. Surely, I thought, we were going to cover all those pesky CRS and OCR API commands we attempted to use. Ironically, Andy spent less time on this chapter than chapter 10 due to the fact that the complete chapter is obsoleted by Grid Control Release 3 (the book was written for Release 2). In a way, I was sorta glad; the chapter is simply full of screenshots, which have poor presentation quality if you are marching through them during a lecture. Bores me to death.

The one thing that Andy did say about cleanup operations that I wanted to pass along was that adding a node adds about 250mb to SYSAUX, and removing a node subtracts the same amount. So if you have a 16-node cluster, keep in mind that your SYSAUX is going to be over 4gb in size.

Chapter 12: High Availability
Finally the last chapter. Unfortunately, in an effort to breeze through this chapter, Andy reverted to reading a lot of slides which I find particularly unhelpful (because I can read them myself, thank you very much). Additionally, the whole chapter is a summary of Oracle's published Maximum Availability Architecture. But on with my notes.

As noted in my posts from IOUG, 11g will feature rolling upgrades. One of the big topics being pushed is that you will be able to upgrade from 10gR2 to 11g without downtime. I am sure there are strings attached, but we will have to wait and see. 11g is supposed to "unveiled" this coming July 11th.

The mindset one must have when developing against a RAC is to not lock tables. Obviously, there are times when you are required to lock a table, but care must be taken to do so as infrequently and as quickly as possible. For the rare occasion when you have a really long row (lots of columns, large datatypes), here are some helpful hints:
- Keep frequently used columns at the front of the table definition
- Always define precision and scale

One question I got to thinking of is if we add a node to our Chicago RAC, will the Urbana failover cluster also get a new node. It should. It better! *grin*

The database will use either standard REDO or standby REDO, but never both at the same time. Hence, it is suggested to define both, and on the same disk.

Along the lines of establishing redo logfiles, the limiting parameter (MAXLOGMEMBERS) used to specify a very hard limit. In fact, you could not change the parameter without recreating the controlfile. This has been changed in R2 (or so I am told) and the following parameters are effectively dynamic (what does that mean?):
- MAXLOGFILES
- MAXLOGMEMBERS
- MAXINSTANCES

Also, if you plan to stay away from an RMAN catalog, it would be wise to bump up the MAXLOGHISTORY to 10,000.

Andy pointed out an interesting revelation if you are successfully able to segregate all DML away from query operations; simply point your DML applications to the primary and redirect query users to a logical standby with all the required indexes. A big bonus for dividing the workload and use of the system.

Fast-Start Failover is also covered in the chapter, but Andy whipped through it with the comment "It DOES NOT work!"

In summary of the white papers and publishes Best Practices, page 12-23 has a chart. In light of everything we talked about this week, Andy made some corrections. For instance, in direct opposition to one of the suggestions, "DO NOT use Flashback Database!". Also, do not autotune checkpointing, do it manually.

I am going to close this post here, but I do hope to collate and organize my thoughts on what to "bring home" as we start our own RAC project. Definitely the Best Practices point out some good things to look into. My biggest concern is that we will not have a load to test with, hence some of the finer aspects of Performance Tuning are going to be hidden and we will have to go with what others have said.

Time for some real work. *grin*