Monday, November 12, 2007

Day 3: Understanding and Assisting the CBO

Session 1: Basic Cost Arithmetic
Anyone that has read any of Jonathan's previous works will recognize the information surrounding the costing formulas; he also gives generous credit to Wolfgang Breitling and Alberto Dell'era. What is really great is that all three of these guys are very generous in sharing of their knowledge and experience, wishing to enhance the community by helping others. Just amazing!

One of the first things Jonathan went over was the environment; there are a lot of factors that play a part in calculating costs of various different pieces of Oracle. The system stats are very important in that they determine cpu speed and io latency, which in turn determine how fast basic Oracle operations occur. Another important number is multiblock read count. I found it very interesting that the hidden underscore parameter _db_file_exec_read_count defaults to db_cache_size/processes (if db_file_multiblock_read_count is not set). Processes; so let's say you set the number of processes really high "just because". You can see that your exec read count will be quite small, and for no good reason.

Jonathan also talked about the importance of sizing the memory parameters appropriately. I think the general impression is that you do not want to gimp your system by being too frugal with memory. Obviously, this will affect large operations (sorts, hashes, etc) more than anything else, but those large operations can get real nasty real fast (think multipass for hashes). Two underscore parameters that Jonathan highlighted were _pga_max_size and _smm_max_size (unfortunately, there were not many details on what these two did, or I missed them).

He made a very interesting statement in that truncating a table does not reset the stats. That sounded very strange to me. Is that true in 10g?? If Oracle goes to the trouble to reset the highwater mark, why would the stats not be updated? They merely have to be zeroed out.

We spent a good chunk of time on clustering. Not just in this session, but in others as well. There is quite a serious flaw in how clustering is calculated for data that is scattered. Consider the case where row(n) is in block(mod(n,2)), or in other words, all the odd rows are in block 1, even rows in block 0. To determine clustering, Oracle will walk the data and count each time the block id changes. Since the id changes for each row, Oracle will calculate the clustering factor really really small, when in fact, the clustering is actually pretty good (total of two blocks, half your data is in one block or the other). A low cluster factor translate into a high IO cost. An articially high IO cost may lead to a sub-optimal plan (where a good plan that has a false high IO cost is ignored in favor of a lower-cost other plan).

This also prompted me to learn more about clustering factor.

Related to that, I was convicted several times of a need to identify what our "significant" or "important" data is. That is probably the number one question I returned to again and again. What is the Banner "business important" data? There are several dictionary views and object statistics (including predicate statistics) that I want to investigate further, but I actually do have an overly optimistic hope that our vendor has some idea.

There are a couple other flaws in IO costing that Jonathan pointed out.
Flaw #1: Oracle assumes that index block reads are from disk, not cache. Always.
Flaw #2: Oracle assumes index single-block read is the same cost as table multi-block read. Always.

TO address Flaw #1, we have optimizer_index_cache, which tells us, on average, how many of our index blocks are in cache (expressed as a percentage). Jonathan stressed that this is only relevant for index root and branch blocks, as leaf blocks are not cached. I have a hard time believing leaf blocks are not cached, and that is something else I would want to look into at some point. Perhaps I merely misunderstood him.

For Flaw #2, we have optimizer_index_cost_adj which tells us what percentage of a multiblock read is an index read (probably somewhere in the vicinity of 33%).

However, for both issues, Jonathan suggests that neither be set if (a BIG IF) system stats have been collected and are correct.

Jonathan pointed out a curious rounding error introduced with optimizer_index_cost_adj; the calculated cost will be rounded down, which can potentially lead to the wrong index being chosen. Event 10183 can be used to turn off cost rounding.

On the topic of multiblock reads, Jonathan showed us a chart demonstrating how Oracle scales the parameter away from excessively high numbers. For a value of 4, the adjusted value may be 4.175. But for higher values, say 32, the adjusted value might be 16.407. 128, 40.85. Due to system stats, Jonathan is recommending that most people might want to turn off db_file_multiblock_read_count.

In light of the importance of system stats, it would be good for us to review sys.aux_stats$, and or dbms_stats.get_system_stats().

Since clustering_factor has no meaning with bitmap indexes (think about it), it is overloaded to count the number of entries in the bitmap; "some key values will have multiple rows".

Even though there was only one slide and a short blurb about this, I was struck by the last point of the session. Using first_rows_n optimizer modes is really going to play havoc with explain plans. Costs and execution paths are essentially squeezed to fit into the first row target, much like doing a "where rownum <= n". This fact alone makes it very difficult to interpret exactly what the explain plan is attempting to convey.

Session 2: Join Mechanisms
Understanding Joins is important because Oracle attempts to convert everything into a two table join, including subqueries. This is not necessarily bad, it is just the way Oracle does it.

Jonathan's slides included a few examples of correlated and non-correlated subqueries. Jonathan's observation is that most "bad" join decisions are made because of cardinality being off, be it off by one or off by millions. Being off by one can go a long way. Which, in my mind, makes Wolfgang's Cardinality Feedback all the more important. Also, "off by one" can be really critical when you consider that perhaps the stats are just a tad old and new rows have come into the table, or have been deleted. That could be the difference between a "good" join and a "bad" join. Scary, eh?

There are a number of slides that talk about Nested Loop joins, but I am going to skip over to Merge joins. Actually, skipping all the way to One Pass sorts. There is an Urban Legend that all in-memory sorts are faster than disk sorts. Jonathan prooved this is not always the case. Up until, Oracle used binary insertion trees to maintain a map for merges; the bigger the memory, the larger this tree could potentially become (smaller memory would force a flush to disk). Because of such large sizes of binary trees, the CPU requirements shot up exponentially to maintain the index. Hence, with those versions, more memory actually made those merge operations far worse.

Jonathan had a caution about 9i 10053 trace events; the "Max Area Size" listing is completely wrong and useless, for it ignores the Workarea_size_policy parameter.

And more caveats. When Oracle executes asynchronous writes, the wait time is not recorded (because it is asynchronous). You have to keep an eye on direct_path_temp to see if the numbers are stacking up.

He introduced two trace events, 10032 and 10033. The former dumps statistics about sorts, the second traces IO for sorts to disk. Since file sizes are small for event 10033, it is not impractical to set it at the system level when diagnosing a sort issue.

Multipass sorts are extremely bad. Avoid them!! Setting sort_area_size large enough to avoid multipass sorts is highly recommended. He quotes Steve Adam's as saying you should be able to get by with a sort_area_size of 45mb to sort 12gb.

Hash joins are a bit interesting. Among other things, Jonathan talked about how the workarea_size_policy comes into play. If set to manual, each hash queue will get hash_area_size/2 memory. What happens if you have more than 2 queues? That's right, you have queues whose sum of memory exceeds hash_area_size. The worst part is that each queue gets this memory regardless if it is needed or not. Say you want to hash 2k of data. You still get hash_area_size/2 for each queue. Wastage!

Setting workarea_size_policy to automatic allows the queues to only grab what they need.

Jonathan also spend some time on Trace Event 10104, "One of the most useful traces in the Oracle suite". It gives a comprehensive break down of hashes, and can be used with Trace 10046 for even greater detail. For multipasses, look for terms like how much memory is available (Memory for slots, in bytes) vs Estimated Build size. Also "Total number of partitions" vs "Number of partitions that fit in memory". That might be your first clue. There is also a stat for number of rows iterated; the fewer iterations the better.

In the end, Hashes may scale more linearly, if AND ONLY IF you do not hit an exception, and there are a number of those. Nested Loops grow faster, but tend to be smoother especially if the index/data is buffered.

Session 3: Selectivity and Hints
The rules governing selectivity start off relatively basic (although still a little bit of math involved), but quickly get complicated in my opinion. The boundary conditions really throw you. For join selectivity, one thing to be aware of is that Oracle assumes predicate independence. Jonathan has bantered this in his blog (where I first read about it). For example, consider two columns, 'month' and 'number_of_days' which tells the number of days in the month. If you use "where month between 10 and 12 and number_of_days = 30", Oracle assumes that any of those months can have 30 days. We know that is not true.

Jonathan had a great example calculating cardinality in a sample query; I'll not repeat it here for the time being.

One gotcha to remember (among many) is that when the optimizer generates numbers for you, it is not necessarily against the sql query you provided. In fact, you might as well assume it is NOT the same query; the numbers are for the interally optimized plan, of which only snippets are available to you via the 10053 trace.

We also covered transitive closure. Jonathan has a bit to say in his book and blog, and the bottmline is that sometimes Oracle will remove a predicate that it thinks is redundant (ie, a = b, b = c, therefore a = c). Lucky you. This can often lead to Cartesian joins, or alternative access paths being used (remember, generally speaking, access paths should have higher priority than filter predicates).

There are also a number of sanity checks to be aware of. One that Jonathan demonstrated is a case where Oracle will not choose ndv values from opposite tables when doing a multi-column join. I am still reading those slides, as it is taking me a little while to comprehend the ramifications. However, one of the issues seems to be that merely making a cosmetic change to your code opens it up to the possibility of hitting one of these sanity checks, and has the potential to throw the optimizer a huge wrench. Just remember that there are no certainties.

Jonathan also had some very strong words about hints. In fact, he even goes so far to say that "A 'hint' is a directive that the optimizer MUST accept and act upon." Now, as Jonathan pointed out, sometimes in an act of obedience, Oracle will ignore the hint. *grin* Yes, sounds contradictory. If you give a full(t) and index(t) hint, by "obedience" Oracle will consider the plans that both hints dictate (yes, BOTH hints), and will give you the cheapest plan. That is the key point. Oracle uses a Cost-Based Optimizer, and will always generate numbers to determine the cheapest plan. If you use hints, you want to use enough hints to trim out all other plans that the optimizer might possibly choose. The 10053 trace is handy in figuring that out.

He gave two examples to showcase this. One with a noparallel hint on the table; Oracle found a cheap plan using parallel on the index (think of a naughty two-year old). In another case, say you specify parallel on a table, but the plan is serial. That is because the serial plan is cheaper.

In another example, Jonathan shows the use_nl hint, and specifies two tables /*+ use_nl(t1 t2) */. This does *NOT* mean nest loop t1 and t2 with t1 as the first table. No no no. It means /*+ use_nl(t1) use_nl(t2) */. Yes indeed. So it tells Oracle to use a nested loop on t1 and a nested loop on t2, but it does not specify an order. In the example it seems to work because Oracle only came up with two plans, and the one we expected just happened to be the cheaper one. "Sometimes you get lucky". Sometimes, even though you did something wrong, you got the results you were expecting to see. Be careful about that.

As always, be wary of "exceptions". There is a hint (and underscore parameter) that specifically says "ignore hints." Go scratch your head over that one.

The ORDERED hint is funny in that Oracle seems to apply it in the end game of optimization. Jonathan specifically gave us a Hash example in which the Ordered hint was obeyed, but the plan was in the oppositive order. Apparently. Watch out for subqueries because remember that Oracle likes to unnest them, and might screw up your seemingly simple order. In 10g, we can use the LEADING hint instead. Still have to watch out for query block names that are dynamically generated due to optimizations; it is possible to specify non-pre-existing query blocks if you know how Oracle is going to name them. If you do not, your order may be a little different.

10g also makes it easier to logical name the hint operands. You can say "index(t1(id1))" or "index(@sub2 t4@sub2(" Obviously, those are not physical names, but labels. Jonathan has observed that it is easier to provide a negative hint than a positive one.

Bottom line, in order to use hints well:
  • Set the join order
  • Join method for n-1 tables
  • Access path for every table
  • Average two hints per table to do it well

Session 4: Telling the Truth
Last day, last session. =) Need I say more?

Some interesting things about this session. Jonathan gets into some strange anomalies and how they are avoided if you give more, seemingly irrelevant, information. For instance, defining columns (esp. mandatory columns) as "not null" helps the optimizer tremendously when using not null predicates.

The stats play a big role in "telling the truth". Old stats are just as bad as bad stats. However, Oracle does provide a way to synthesize stats if needed; I found this whole concept quite intriguing when reading his book and papers earlier. And it comes back to what is your "interesting" data? What data does the application/business like? Oracle has several rules for dealing with "normal" data, but it is entirely possible that you are interested in anything but. Which makes it important to identify the "important" data, and take advantage of the tools that Oracle has given us to help convery that to the optimizer. Stats, and possibly histograms.

Jonathan went back to the example for clustering factor. It is entirely possible to have data that Oracle thinks is scattered, when in fact it is merely grouped weird. That is a case of "lying" to the optimizer.

For the most part, the default stats work. Oracle CBO assumes a normal, boring distribution of data, so if that is the case, you are all set to go. When the data is grouped into "weird, wacky and stretchy bits", the stats and histograms may have holes. Sometimes is still does a fairly good job to compensate, and sometimes you just get unlucky. =)

He had an interesting blurb for histograms in OLTP. He is suggested that instead of letting the normal stats collect histograms, the front-end should be aware of "important" data, and code for it appropriately. Can you imagine Banner doing that? Neither can I.

I will have to wrap it up with that. I will try to over these notes from the Days with Jonathan Lewis, but I need a break.


mdinh said...

Thanks for the blog!

mathewbutler said...

Comprehensive as the earlier notes. A bit long for a blog post. Though I'm not complaining - the best info on these three days that I've found on the net.


Charles Schultz said...

Thanks, Matt. Yes, they are a bit long and ramble, but I could not see a better way to 1) convey what I learned, 2) preserve some of my thoughts for future reference. Plus, I am not a professional writer, so my sentence structures are sometimes choppy. But I figure those who do not like it will merely go away. *grin*

Anonymous said...

Hi Charles ,

Did Jonathan mention the platform in which he develop the tests? Im my Windows XP laptop and Oracle I got the next results:

SQL> alter system set db_file_multiblock_read_count=0;

SQL> select indx from x$ksppi where ksppinm ='_db_file_exec_read_count';


SQL> select KSPPSTVL from x$ksppcv where indx=589;



SQL> select current_size from v$buffer_pool;


SQL> show parameter processes

------------------------------------ ----------- ---------------------
processes integer 150

Is it a bug in Windows?


Charles Schultz said...

I apologize for the delay. I am not exactly sure how to answer that. I will first admit ignorance about that parameter. However, even on my own Solaris platform, I never see the formula working out so neat. Would be a good question to pass along to Jonathan. =)

BTW, I believe he uses a windows platform when he does his classes. Not sure if that is his "main" platform or not. And this would obviously be before even came out.