Tuesday, September 25, 2007

RAC: "How not to design an application"

We have a new Luminis database that was being tested for an upgrade (4.01 or something like that). We found that we were constantly maxing out db_recovery_file_dest_size (database hangs, look in alert.log). The following is what I sent to my team members.

In RAC class, Andy warned us about the redo logs and how they can be a source of contention for some very basic things. Let us use LUMDEV as an object lesson.

Luminis has this little heartbeat function (I do not know the official name or the official purpose, but this is my observation) that updates 1 row of 1 table; a table with only 1 row. It does this continually. Better yet, this is execute not from one session, but from 7 sessions! So you have 7 sessions banging against this 1 row of this 1 table all the time. Obviously, the data in that row is not meant to stick around a long time.

Usually, in a normal database this is not such a big deal. Might cause some waits because of an exclusive transaction, but nothing that causes the rest of the database to wait or contend. In RAC, the world changes drastically.

Andy taught us that each instance masters its own list of global cache locks. These locks exist for every single block in a database (you can imagine what a nightmare it is for large databases). The list of master block locks are split up evenly between all participating nodes (and must be remastered when a node is added or deleted). These locks are often upgraded and downgraded (exclusive and shared), and often sent across the private interconnect (fancy word for NIC) to other instances. When an exclusive lock is requested on a different node than what is already holding the exclusive lock, the lock must first be downgraded, then re-upgraded for the requesting instance. Every time a lock is downgraded like this, the redo log buffer must be flushed.

See where this is going? You have 1 table with 1 row, meaning one block. Because the application is repeatedly requesting an exclusive lock from both instances, the global cache lock is constantly downgraded and upgraded. This produces many many many many many many archived logs for all those log buffer flushes. LUMDEV can easily consume over 18gb of redo space on this one operation alone.

Crazy, eh?

On a similar thread, I am not all that crazy about the dual management of both the ASM FRA diskgroup and db_recovery_file_dest_size. Here is what I posted to the ASM Oracle Forums:

For those of us who run multiple databases on a box with shared disk for FRA, I am finding the extra layer of ASM and db_recovery_file_dest_size to be a minor inconvenience. The Best Practice white papers I have found so far say that you should use db_recovery_file_dest_size, but they do not specify how you should set it. Currently, we have been setting db_recovery_file_dest_size rather small, as the databases so far are small and even at 3x the database size, the parameter is still significantly smaller than the total disk available in that diskgroup.

So, my question; is there any downside to setting db_recovery_file_dest_size equal to the total size of the FRA diskgroup for all databases? Obviously, this means that the amount of free space in the diskgroup may be consumed even if db_recovery_file_dest_size is not yet full (as reflected in the instance V$RECOVERY_FILE_DEST). But is that really a big deal at all? Can we not simply monitor the FRA diskgroup, which we have to do anyway? This eliminates the need to worry about an additional level of disk management. I like to keep things simple.

The question is relevant to folks using other forms of volume management (yes, I know, ASM is "not a volume manager"), but seems germane to the ASM forum because most articles and DBAs that I have talked to are using ASM for FRA.

Most importantly, what ramifications does "over-sizing" db_recovery_file_dest_size have? Aside from the scenario above.

I am really toying with the idea of making the recovery destination super-large and relying on the management of the FRA. To check space issues in the FRA, I can use asmcmd (the "du" output sucks, IMO) or query against the instance itself:

2 where NAME = 'FRA';

FREE_MB % Space Used
---------- ------------
43426 .37

I am still looking for a way to generate a "du"-like report for all databases in the FRA.

In the database, I can check the amount of space in the recovery area:

SQL> select (SPACE_LIMIT-SPACE_USED)/1024/1024 "Space Left (MB)", trunc(SPACE_USED/SPACE_LIMIT,2) "% Space Used" from V$RECOVERY_FILE_DEST;

Space Left (MB) % Space Used
--------------- ------------
18624 .04

I would hate to have to login to each database to find out how much recovery space is left before the database hangs.

Arul made a most informative and excellent response:

Basically, set your db_recovery_file_dest_size appropriately, and you will not have to worry about the ASM diskgroup at all; let Oracle maximize/optimize the space when the parameter set to size that tailored to the needs of the database.