Friday, June 17, 2011

Bug 11858963: optimization goes wrong with FIRST_ROWS_K (11g)?

At the beginning of March, I noticed some very odd things in a 10053 trace of a problem query I was working on. I also made some comments on Kerry Osborn's blog related to this matter. Oracle Support turned this into a new bug (11858963), unfortunately an aberration of Fix 4887636. I was told that this bug will not be fixed in 11gR1 (as 11.1.0.7 is the terminal release), but it will be included in future 11gR2 patches.

If you have access to SRs, you can follow the history in SR 3-314198695. For those that cannot, here is a short summary.

We had a query that suffered severe performance degradation after upgrading from 10.2.0.4 to 11.1.0.7. I attempted to use SQLT but initially run into problems with the different versions of SQLT, so I did the next best thing and looked at the 10053 traces directly. After a bit of digging, I noticed several cases where the estimated cardinality was completely off. For example:


First K Rows: non adjusted N = 1916086.00, sq fil. factor = 1.000000
First K Rows: K = 10.00, N = 1916086.00
First K Rows: old pf = 0.1443463, new pf = 0.0000052
Access path analysis for FRRGRNL
***************************************
SINGLE TABLE ACCESS PATH (First K Rows)
Single Table Cardinality Estimation for FRRGRNL[FRRGRNL] 
Table: FRRGRNL Alias: FRRGRNL
Card: Original: 10.000000 Rounded: 10 Computed: 10.00 Non Adjusted: 10.00



So, the idea behind FIRST_ROWS_K is that you want the entire query to be optimized (Jonathan Lewis would spell it with an "s") for the retrieval of the first K rows. Makes sense, sounds like a good idea. The problem I had with this initial finding is that every single rowsource was being reduced to having a cardinality of K. That is just wrong. Why is it wrong? Let's say you have a table with, um, 1916086 rows. Would you want the optimizer to pretend it has 10 rows and make it the driver of a Nested Loop? Not me. Or likewise, would you want the optimizer to think "Hey, look at that, 10 rows, I'll use an index lookup". Why would you want FIRST_ROWS_K to completely obliterate ALL your cardinalities?

I realize I am exposing some of my naivete above. Mauro, my Support Analyst corrected some of my false thinking with the following statement:

The tables are scaled under First K Rows during the different calculations (before the final join order is identified) but I cannot explain any further how / when / why.
Keep in mind that the CBO tweak -> cost -> decide (CBQT is an example)
Unfortunately we cannot discuss of the CBO algorithms / behaviora in more details, they are internal materials.
Regarding the plans yes, they are different, the "bad plan" is generated with FIRST_ROWS_10 in 11g

The "good" plan is generated in 10.2.0.4 (no matter which optimizer_mode you specify, FIRST_ROWS_10 is ignored because of the limitation) or in 11g when you disable 4887636 (that basically reverts the optimizer_mode to ALL_ROWS).
Basically the good plan has never been generated under FIRST_ROWS_10 since because of 4887636 FIRST_ROWS_10 has never been used before



I still need to wrap my head around "the limitation" in 10.2.0.4 and how we never used FIRST_ROWS_K for this particular query, but I believe that is exactly what Fix 4887636 was supposed to be addressing.

Here are some of the technical details from Bug 1185896:

]]potential performance degradation in fkr mode
]]with fix to bug4887636 enabled, if top query block
]]has single row aggregation
REDISCOVERY INFORMATION:
fkr mode, top query block contains blocking construct (i.e, single row aggregation). Plan improves with 4887636 turned off
WORKAROUND:
_fix_control='4887636:off'

I assume fkr mode is FIRST_ROWS_K, shortened to F(irst)KR(ows). The term "blocking construct" is most interesting - why would a single row aggregation be labeled as a "block construct"?

Also, this was my first introduction to turning a specific fix off. That in itself is kinda cool.

2 comments:

SQLTXPLAIN said...

Whenever you run into a SQL performance issue after an upgrade, and if you get back the pre-upgrade performance using OFE, then you may want to run your issue through SQLT XPLORE. This is one of the advanced modules in SQLT that basically toggles all CBO parameters including Fix Control and records all plans discovered during this process. You may find there you pre-upgrade plan together with specific parameters or fix control that produce the same plan (in place of OFE).

Charles Schultz said...

Thanks, that is awesome! I am going to have to learn more about XPLORE and XGRAM, these seem like very interesting tools.