Thursday, February 19, 2009

Complaint about Enterprise Manager (version 10g)

First off, let me say that EM has some really cool stuff going on! This post will be largely negative and critical, but I do not want to discount the fact that the developers of this tool have done a lot of hard work to make some things very neat and slick. Real-time data for performance monitoring at various levels is quite nice. While the interface is kooky (very!), I have to give credit for the massive conglomeration of features that are packed into this "free" tool. And most of them work to boot! *grin*

Ok, so my first gripe is just administrating the EM itself. The logfiles are a complete mess. There is no one log file that summarizes all the activity. If you get an error in the GUI, it will most likely say "see log file for further details". Which log file? Granted, most of them are in one place (depending on if you are using DBControl or GC), but not all of them. Worse, it is very common to find a logfile that repeats what you saw on the GUI and it refers to itself for further details! How is that helpful? I want one file that can either tell me where exactly the problem is, or at least point to the file where the real further details are. Same for the config files. There are xml fiels sprinkled throughout, and .properties. Since EM is comprised of so many different components, there are many different log files and many different configuration files. But nothing ties them all together. Say I want to alter the ports of the components (I have tried this, and it is not pretty). I had to touch about 5 or 6 different files, each with multiple entries for various port numbers. And then there are those ports that are defined by global variables, with no indication where those are set. If one wanted a headache, a lemon-encased gold brick to the skull would be more appealing.

What next.... The GUI itself is extremely cluttered and the opposite of intuitive. I understand that some tools take a while to learn how to navigate (ie, TOAD - still catching on), but there are some functions I use quite often and I always find myself in the wrong place. Here is a tricky example, how about "Historical SQL". Do you mean the sql that EM has generated on your behalf, or AWR SQL? What about "features" that do not work quite well. We have tried Data Masking and Change Capture, both of which hung and finally died without any visible productive work. And trying to debug that?? See above paragraph. =) Or how about that infamous "remove datafile" option in the Tablespace screen? Cool, I did not know you could remove a datafile from a tablespace, this must be new. Yeah, try it and get back to me on that. I have had the opportunity to Queisce my database by accident; if you flip through the Maintenance Window management screens (just click ok or next), you can freeze your database because one of the options to specify the type of the current maintenance window defaults to Quiesce, even if that is not the current type. Handy, eh?

I am curious what the 11g EM will be like. I briefly looked at the 11g Database Control, but not enough to make any kind of judgement. Would it be foolish to even hope that the administration side of things has been made easier?

'nuff for now.

Monday, February 16, 2009

A tuning performance example using 10046, SQLTXPLAIN and 10053

In my tribute to Carlos Sierra, I implied (well, rather, I just said) that I had more to write. Well, here it is.

I was alerted to a "slow performing" job in our ERP (the one out in Massachusetts). My first response is to turn on a 10046 trace; 99% of the time, this is extremely effective. The other 1% I either hit that strange issue where the event itself will change the query plan (I hate that!) or the job does too much stuff client-side for me to be of much use ("Yeah, umm... you only used .01 seconds of CPU time for your 5-hour job").

I have a script I use to create a trigger that turns on event 10046 (and others). If blogger had a better way to post links, I would put it up. But the guts are pretty easy to figure out - I accept parameters, then basically build a query that does the following:
if (USER = '$user') then
execute immediate 'alter session set max_dump_file_size = UNLIMITED';
execute immediate 'alter session set tracefile_identifier = ''$user''';
-- execute immediate 'alter session set statistics_level = all';
execute immediate 'alter session set events = ''$trace trace name context forever, level $level''';
end if;

If the session is already connected (hmm... that was redundant), I like to do the following:
set feed off verify off
set lines 112
select 'EXEC DBMS_MONITOR.SESSION_TRACE_ENABLE(session_id => '||sid||', serial_num => '||serial#||', waits => TRUE, binds => TRUE);' "Turn Trace on for user" from v$session where username = '&&username';
select 'EXEC DBMS_MONITOR.SESSION_TRACE_DISABLE(session_id => '||sid||', serial_num => '||serial#||');' "Turn trace off for user" from v$session where username = '&&username';
set feed on verify on

I have never seen a really good reason to use only level 4 or only level 8; I would use only level 1 if all I want is the SQL, but for the most part, level 12 all the way, baby!

Just to see what I can see, I run the resulting trace file through tkprof with sort options (sort=prsela,exeela,fchela), which puts the queries with the highest run times at the top of the tkprof output file. Lo and behold, I get a few solid candidates with this particular example.

SSVSECT order by ssbsect_term_code, ssbsect_subj_code, ssbsect_crse_numb,

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.27 0.27 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 996.36 1970.42 88 956167 176632 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 996.63 1970.70 88 956167 176632 0

I was told that this usually takes closer to 5 minutes (not 33 minutes). Stranger yet, all examples of this query with valid predicates trimmed the run-time down significantly (sub-second), which tells me that predicates are being pushed into the view (a good thing).

Just as a litmus test, I ran this query with a privileged account (SYS), and noticed that the query plan was indeed much better - so at least I knew there was a better plan. And then I realized that we use FGAC policies against the underlying tables to enforce VPD. Great. Tuning FGAC is no fun at all. Because I like to have the complete query in front of me, I got the view source (formatted in TOAD), then ran the query with the original user id and event 10730 turned on (captures VPD predicates). I was then able to reconstruct the query that finally got passed into the parser.

Trying to get the original query to use a better plan, I started playing around with hints and statistics (in independent tests). The hints are a pain because, as Jonathan Lewis will often point out, you want to use a ton of 'em (not just one or two). Which got me to thinking about the 10053 trace and how that gives you the hints to reconstruct a query plan in a nice format (easier than reading other_xml out of v$sql_plan). I also stumbled upon some new index statistics that gave me the "better" query plan. As I attempted to explain which statistics did the trick, I turned to SQLXPLAIN to compare the queries for me, and it did a wonderful job (as stated in previous post)! It helped me to narrow down the problem to the fact that the the combination of more rows and fewer leaf blocks had altered the clustering factor enough to be noticeable.

"Now wait", you say, "how do you get more rows and fewer leaf blocks?!?" Easy. If you rebuild your index and do not collect stats for a few months on a busy table... viola! *evil grin* Ok, I need to include the fact that we recently went through a character set conversion which required us to completely rebuild the database; at the time, we also decided to import all the old stats, which is much faster than generating new ones. Unfortunately, we overlooked the fact that indexes will generate new stats when you build them, so importing old stats is a bad idea (in general) for indexes.

So, we have an altered clustering factor. How did that affect the query plan to go from 5 minutes to 33 minutes? To be honest, I do not have all the answers yet. I am still researching the "why" behind this. What I have gathered so far is that the index selectivities have changed dramatically (these came out of the 10053 trace):

OLD/SLOW - uses PK index
Index: PK_SSBSECT Col#: 53 1 2
LVLS: 2 #LB: 4104 #DK: 649804 LB/K: 1.00 DB/K: 1.00 CLUF: 30917.00

Access Path: index (IndexOnly)
resc_io: 3.00 resc_cpu: 21564
ix_sel: 7.6946e-07 ix_sel_with_filters: 7.6946e-07
Cost: 3.01 Resp: 3.01 Degree: 1

NEW/FAST - use table scan instead
Index: PK_SSBSECT Col#: 53 1 2
LVLS: 2 #LB: 3072 #DK: 689157 LB/K: 1.00 DB/K: 1.00 CLUF: 34646.00

Access Path: index (IndexOnly)
resc_io: 2294.00 resc_cpu: 119163783
ix_sel: 0.74604 ix_sel_with_filters: 0.74604
Cost: 2326.68 Resp: 2326.68 Degree: 1

So, CLUF increased by 12% and LB decreased by a whopping 25% and DK up by 5.7%. So how did I end up with a ix_sel that was different by 10^6? I mean, I see why a tablescan was preferred! =) But what the heck?

Did I mention I am eagerly awaiting for that 10053 parser? =)

Friday, February 13, 2009

Kudos to Carlos Sierra

I have been playing more and more with Oracle's SQLTXPLAIN; or rather, let me give the full credit to Carlos Sierra, who has written and fully supports this nifty little gem.

Jonathan Lewis get me hooked on dbms_xplan, and I like it a lot! Just the other day I picked up +PEEKED_BINDS as an option from various posters (Tanel Poder, Greg Rahn, Jonathan Lewis) and started using it right away. And I never really tackled SQLTXPLAIN just because.... I had used the earlier version (TRCANLYZR...or something). Until recently.

Carlos has really done a bit of quality work on this piece; he has provided excellent documentation and some good examples. He has shown a wonderful amount of foresight in generating various traces (10053, 10046) and packages functionality like generating testcases, comprehensive web reports and sql plan comparisons. The last of those I had an opportunity to explore today, and it helped me find a problem with some indexes (that will be a seperate post after the dust settles). I just really like the presentation and functionality of this tool. The "main" html report is very tasteful with expand/collaspe java snippets, go flow, and tons of useful information. The comparison did a great job at highlighting areas of concern, and his "Observations" often make me think about something, even if only to affirm. I like the flexibility of grabbing a sql_id, sql hash or plan_hash.

True, it is not perfect yet. Very close, though. I do not like how the install scripts ask for a "schema user" or "application user". To work around that, I create a new role (SQLTXPLAIN_ROLE) and provide that as my "application_user". Then, whenever I want to run SQLTXPLAIN, I just grant/revoke that role to the real application userid. Also, altering the parameters are a little kludgy. I love that Mr. Sierra has made the parameters available and has provided documentation for them (a pleasant surprise), but it would be nice to specify them at run time in a more "normal" fashion. Like, having a flag to turn off the table exports. And what about using datapump instead? =)

One other thing that I am still working on is to get it to work as a client-side piece. After working through a few kinks (ie, my bads) with Sierra, the only portion that bothers me a little is that it keeps asking me for the password for exporting tables. For the most part, I want to skip that.

Anyway, this tool is AWESOME! I love it! I will still use dbms_xplan a bit (afterall, this tool does as well), since it is much lighter weight. But I am definitely adding SQLTXPLAIN to my toolbox and telling others about it. I have also had a few exchanges with Carlos, and he is definitely on the ball.

I am extremely curious about one thing yet. There are a number of interesting tables in the SQLTXPLAIN schema. What is SQLT$_10053_PARSE for? I have never seen it populated with any information. Is there a tool that parses the 10053? Can I have it?!? =)

Now I wonder what other tools BDE has hidden away....