Monday, May 16, 2011

alert.log appears not be updated

After a few days of spinning my wheels and subjecting the poor recipients of oracle-l to multiple posts, I have identified an issue in Oracle code that I believe needs to be looked at.

First, some background.
We are running Oracle EE 11.1.0.7 on Solaris 10. We also have a job that occasionally bzips (compresses) the alert.log. The logic in the job is supposed to check if the file is actively being written to before zapping it, but by pure chance (so it would seem), in this particular case the alert.log was still open by the database when the file was scorched. This led to the appearance of the alert.log not receiving any more updates from the database. We attempted to bounce the database which had no discernible effect. I also changed the diagnostic_dest, which caused us to go from slightly strange to absolutely bizarre, and what opens the door for the rest of this post.


What I found
After changing diagnostic_dest several times, posting on oracle-l, the Oracle Community forums and playing tag with an Oracle Support Analyst, and doing lots of truss commands against sqlplus, I started to focus on this result from truss:
access("./alert.log", F_OK)              = 0

Now, you may notice that this "access" command is saying that the file in question ("./alert.log") is legit. This caused no small amount of head-scratching. I got the same results no matter which directory I ran the commands from. In my system, I only had two files with this name, one in $ORACLE_HOME/dbs and one in $DIAG/trace. Neither were actively updated by the database. It was not clear to me, at first, that Oracle was finding one of these log files. Especially since it never did anything with it. I searched file descriptors in /proc/*/fd and found nothing. I even grepped keywords from all text files looking for strings that should show up in this particular alert.log.

For the life of me, I could not figure out what directory ./alert.log was in. When I compared to other databases, this same access always returned Err#2 ENOENT. So I knew this must be key, but not sure exactly how. On a whim, I decided to delete the alert.log in $ORACLE_HOME/dbs. Lo and behold, the problem seemed to go away magically.

The BUG
So here is the root problem, in my opinion. The Oracle code line is looking for $ORACLE_HOME/dbs/alert.log, but completely fails to write to the file if it is found. Instead, the branch simply exits out. How is that helpful?

In retrospect....
I believe when I changed diagnostic_dest to a non-existing directory, Oracle automatically created alert.log in $ORACLE_HOME/dbs. I guess I learned a few things from this. :) Also, I learned a few tidbits along the way. One can use KSDWRT to write messages to the alert.log. Dan Morgan's library (still hosted by PSOUG) shows this. Also learned a little more about truss and dtrace as I was researching this issue.

Now the hard part; convincing Oracle that this is a problem and needs to be corrected.

3 comments:

Charles Schultz said...

FYI - this is very repeatable. Just touch a alert_$SID.log in $ORACLE_HOME/dbs and viola, your alert.log is as good as dead. :)

Works in 11.1.0.7 and 11.2.0.1

Charles Schultz said...

Reproduces in 11.2.0.2 on Solaris. Does not reproduce on 10.2.0.2 or 10.2.0.4. Another user is saying it does not reproduce on 11.2.0.2 HP-UX.

Charles Schultz said...

Simple way to Do-It-Yourself - if the "access" line in the truss file has a return code of 0 and the alert_MYDB.log file is never opened, you have hit the bug:


/u01/app/oracle: dbs
/u01/app/oracle/product/11.2.0.2/dbs: ll ale*
ale*: No such file or directory
/u01/app/oracle/product/11.2.0.2/dbs: db
MYDB
/u01/app/oracle/product/11.2.0.2
/u01/app/oracle/product/11.2.0.2/dbs: touch alert_MYDB.log
/u01/app/oracle/product/11.2.0.2/dbs: truss -aefo test.txt sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Mon May 16 16:07:34 2011

Copyright (c) 1982, 2010, Oracle. All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> show parameter timed_statistics
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
timed_statistics boolean TRUE

SQL> alter system set timed_statistics = true;

System altered.

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
/u01/app/oracle/product/11.2.0.2/dbs: grep alert_MYDB.log test.txt
4713: access("./alert_MYDB.log", F_OK) = 0
/u01/app/oracle/product/11.2.0.2/dbs: rm alert_MYDB.log