Monday, February 16, 2009

A tuning performance example using 10046, SQLTXPLAIN and 10053

In my tribute to Carlos Sierra, I implied (well, rather, I just said) that I had more to write. Well, here it is.

I was alerted to a "slow performing" job in our ERP (the one out in Massachusetts). My first response is to turn on a 10046 trace; 99% of the time, this is extremely effective. The other 1% I either hit that strange issue where the event itself will change the query plan (I hate that!) or the job does too much stuff client-side for me to be of much use ("Yeah, umm... you only used .01 seconds of CPU time for your 5-hour job").

I have a script I use to create a trigger that turns on event 10046 (and others). If blogger had a better way to post links, I would put it up. But the guts are pretty easy to figure out - I accept parameters, then basically build a query that does the following:
if (USER = '$user') then
execute immediate 'alter session set max_dump_file_size = UNLIMITED';
execute immediate 'alter session set tracefile_identifier = ''$user''';
-- execute immediate 'alter session set statistics_level = all';
execute immediate 'alter session set events = ''$trace trace name context forever, level $level''';
end if;

If the session is already connected (hmm... that was redundant), I like to do the following:
set feed off verify off
set lines 112
select 'EXEC DBMS_MONITOR.SESSION_TRACE_ENABLE(session_id => '||sid||', serial_num => '||serial#||', waits => TRUE, binds => TRUE);' "Turn Trace on for user" from v$session where username = '&&username';
select 'EXEC DBMS_MONITOR.SESSION_TRACE_DISABLE(session_id => '||sid||', serial_num => '||serial#||');' "Turn trace off for user" from v$session where username = '&&username';
set feed on verify on

I have never seen a really good reason to use only level 4 or only level 8; I would use only level 1 if all I want is the SQL, but for the most part, level 12 all the way, baby!

Just to see what I can see, I run the resulting trace file through tkprof with sort options (sort=prsela,exeela,fchela), which puts the queries with the highest run times at the top of the tkprof output file. Lo and behold, I get a few solid candidates with this particular example.

SSVSECT order by ssbsect_term_code, ssbsect_subj_code, ssbsect_crse_numb,

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.27 0.27 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 996.36 1970.42 88 956167 176632 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 996.63 1970.70 88 956167 176632 0

I was told that this usually takes closer to 5 minutes (not 33 minutes). Stranger yet, all examples of this query with valid predicates trimmed the run-time down significantly (sub-second), which tells me that predicates are being pushed into the view (a good thing).

Just as a litmus test, I ran this query with a privileged account (SYS), and noticed that the query plan was indeed much better - so at least I knew there was a better plan. And then I realized that we use FGAC policies against the underlying tables to enforce VPD. Great. Tuning FGAC is no fun at all. Because I like to have the complete query in front of me, I got the view source (formatted in TOAD), then ran the query with the original user id and event 10730 turned on (captures VPD predicates). I was then able to reconstruct the query that finally got passed into the parser.

Trying to get the original query to use a better plan, I started playing around with hints and statistics (in independent tests). The hints are a pain because, as Jonathan Lewis will often point out, you want to use a ton of 'em (not just one or two). Which got me to thinking about the 10053 trace and how that gives you the hints to reconstruct a query plan in a nice format (easier than reading other_xml out of v$sql_plan). I also stumbled upon some new index statistics that gave me the "better" query plan. As I attempted to explain which statistics did the trick, I turned to SQLXPLAIN to compare the queries for me, and it did a wonderful job (as stated in previous post)! It helped me to narrow down the problem to the fact that the the combination of more rows and fewer leaf blocks had altered the clustering factor enough to be noticeable.

"Now wait", you say, "how do you get more rows and fewer leaf blocks?!?" Easy. If you rebuild your index and do not collect stats for a few months on a busy table... viola! *evil grin* Ok, I need to include the fact that we recently went through a character set conversion which required us to completely rebuild the database; at the time, we also decided to import all the old stats, which is much faster than generating new ones. Unfortunately, we overlooked the fact that indexes will generate new stats when you build them, so importing old stats is a bad idea (in general) for indexes.

So, we have an altered clustering factor. How did that affect the query plan to go from 5 minutes to 33 minutes? To be honest, I do not have all the answers yet. I am still researching the "why" behind this. What I have gathered so far is that the index selectivities have changed dramatically (these came out of the 10053 trace):

OLD/SLOW - uses PK index
Index: PK_SSBSECT Col#: 53 1 2
LVLS: 2 #LB: 4104 #DK: 649804 LB/K: 1.00 DB/K: 1.00 CLUF: 30917.00

Access Path: index (IndexOnly)
resc_io: 3.00 resc_cpu: 21564
ix_sel: 7.6946e-07 ix_sel_with_filters: 7.6946e-07
Cost: 3.01 Resp: 3.01 Degree: 1

NEW/FAST - use table scan instead
Index: PK_SSBSECT Col#: 53 1 2
LVLS: 2 #LB: 3072 #DK: 689157 LB/K: 1.00 DB/K: 1.00 CLUF: 34646.00

Access Path: index (IndexOnly)
resc_io: 2294.00 resc_cpu: 119163783
ix_sel: 0.74604 ix_sel_with_filters: 0.74604
Cost: 2326.68 Resp: 2326.68 Degree: 1

So, CLUF increased by 12% and LB decreased by a whopping 25% and DK up by 5.7%. So how did I end up with a ix_sel that was different by 10^6? I mean, I see why a tablescan was preferred! =) But what the heck?

Did I mention I am eagerly awaiting for that 10053 parser? =)

1 comment:

Charles Schultz said...

Oracle Support says:
Bug 6503543 - Suboptimal plan due to bad cardinality estimate for RANGE join predicates

However, I am not fully convinced. The query is executing much faster *because* of the bug. =)