After a few days of spinning my wheels and subjecting the poor recipients of oracle-l to multiple posts, I have identified an issue in Oracle code that I believe needs to be looked at.
First, some background.
We are running Oracle EE 18.104.22.168 on Solaris 10. We also have a job that occasionally bzips (compresses) the alert.log. The logic in the job is supposed to check if the file is actively being written to before zapping it, but by pure chance (so it would seem), in this particular case the alert.log was still open by the database when the file was scorched. This led to the appearance of the alert.log not receiving any more updates from the database. We attempted to bounce the database which had no discernible effect. I also changed the diagnostic_dest, which caused us to go from slightly strange to absolutely bizarre, and what opens the door for the rest of this post.
What I found
After changing diagnostic_dest several times, posting on oracle-l, the Oracle Community forums and playing tag with an Oracle Support Analyst, and doing lots of truss commands against sqlplus, I started to focus on this result from truss:
access("./alert.log", F_OK) = 0
Now, you may notice that this "access" command is saying that the file in question ("./alert.log") is legit. This caused no small amount of head-scratching. I got the same results no matter which directory I ran the commands from. In my system, I only had two files with this name, one in $ORACLE_HOME/dbs and one in $DIAG/trace. Neither were actively updated by the database. It was not clear to me, at first, that Oracle was finding one of these log files. Especially since it never did anything with it. I searched file descriptors in /proc/*/fd and found nothing. I even grepped keywords from all text files looking for strings that should show up in this particular alert.log.
For the life of me, I could not figure out what directory ./alert.log was in. When I compared to other databases, this same access always returned Err#2 ENOENT. So I knew this must be key, but not sure exactly how. On a whim, I decided to delete the alert.log in $ORACLE_HOME/dbs. Lo and behold, the problem seemed to go away magically.
So here is the root problem, in my opinion. The Oracle code line is looking for $ORACLE_HOME/dbs/alert.log, but completely fails to write to the file if it is found. Instead, the branch simply exits out. How is that helpful?
I believe when I changed diagnostic_dest to a non-existing directory, Oracle automatically created alert.log in $ORACLE_HOME/dbs. I guess I learned a few things from this. :) Also, I learned a few tidbits along the way. One can use KSDWRT to write messages to the alert.log. Dan Morgan's library (still hosted by PSOUG) shows this. Also learned a little more about truss and dtrace as I was researching this issue.
Now the hard part; convincing Oracle that this is a problem and needs to be corrected.