Tuesday, June 26, 2007

RAC: The big install

So, we are attempting our first "real" RAC install; not canned, not pre-fabricated, but using only software downloaded from OTN and following Oracle Documentation and various forms of cliff notes. This is one of those things that is really sweet if it works 100%. Otherwise, you are in for a headache. We have a headache.

That RAC class was good for teaching things, but it also perpetuates a false sense of security when things go wrong. And from what I can tell from all the notes and pleas for help out there, things go wrong often. One of the mantras I hear is to follow the documentation exactly! This is all good, but the documentation itself comes in many forms. Do you follow what Oracle has said, or do you pick some expert (K Gopal or Julian Dyke) and follow what they say? Cluster Verify (cluvfy) is also a little misleading; it will not check for 100% compatibility with the installed RPMs. In fact, I even had one Oracle Support analyst tell me that was the DBAs job. That is a lot to swallow. Take a DBA who does not know anything about Linux and tell him to verify that 24 RPMs not only exist, but are compatible with the required RPMs. I tried to write a script for it, but in the end, the only "failsafe" way to do it is by hand. I say "failsafe" because human error plays a large role in these RAC-related problems as well.

It would seem to me that one good way to eliminate, or at least reduce, human error is to automate. Dell IT has taken this to extremes and automates a vast majority of their day-to-day tasks. Checking the RPMs is just a small fraction of something that could easily be automated. What about user equilvalence? What about all those silly root scripts? Or running oracleasm to configure and create disks by hand? What boggles my mind is that 10g RAC does so much that is really cool and automated; when the sun shines, life is good! Why are some basic things left out, but you have some nifty tools like cluvfy that is really slick at verifying a good chunk of your install work?

Ironically, our CRS installation was hunky-dory. The rootpre.ksh was a bit weird (why is it checking for 9i CM??), and double-checking all the paths and homes is about the only thing that slowed us down. Things went south when it was time to install ASM. Our first warning flag was that the swap space was not big enough. Thinking it was a red herring, we ignored the warning. Later on, after the software was installed and the configuration assistants were running, we hit our first major roadblock; link not satisfied on njni10. Not much that seem relevant on google or metalink. Oracle Support told us to attempt the installation again. Now think about this; the analyst assigned to us specializes in NetCA (that is why we filed the SR). This guy tells us to simply re-install ASM. Having had ASM problems in class, I was not exactly happy about that. Remove Oracle Homes, zero out raw disks, make sure no processes are running, and away we go. This time around, ASM cannot see all the disks. So when I tell my support analyst that we have new problems, he has to bring in a database specialist because the original guy does not know anything about ASM. What a joke! On top of that, he "reminds" me to keep the scope of the SR to one issue. GRRR!!! Of course, we are subjected to the usual onslaught of new questions and request for an RDA. I am actively ignoring them. We were able to work around a large number of our problems, but in the end, we want to simply wipe the slate clean and start over.

Deleting everything and wiping the slate clean is not easy. No sir-ee. This is where having root privs come in really handy, because of someone's ultimately wishful thinking, the CRS Oracle Home is installed with root as the owner. By default, oracle does not have any privileges to remove or modify anything in the directory, and only limited privs to execute anything. For instance, running crsctl evokes a "not enough privileges" error. Not to mention the slew of root-owned processes (crs, css, emv) that have to be dealt with.

What fun.

On a separate note, we were supposed to have a webinar with our ERP vendor (SunGard Higher Education, or SHE as some say) on the topic of Oracle RAC. *cough cough* I went with the intention of mildly heckling them, but they had technical difficulties with the virtual presentation. Sounds like even putting the letters R-A-C on something is prone to make it break. *grin*

Seriously, though, I know we will not be moving towards RAC any time soon for our production ERP system, and I am very curious to see how other schools manage it. In a morbid sense, I am also curious if they are buying the line from some sales person about how it will help their system, or some form of HA. RAC looks great on paper, but after scratching the surface as I have, it ain't all that pretty underneath. Don't get me wrong, as I mentioned earlier, it does a lot of cool stuff, and it does it well. But there are two sides to that coin, so it would be wise to keep things in perspective.

Wednesday, June 20, 2007

ceterum censeo

No, I am not advocating that Carthage must be destroyed. I think someone already beat me to that anyway.

But I did want to encapsulate what I can "bring home" from the RAC experience I had in Chicago. And I am obligated to include a sumptuous feast at Fogo de Chão. Vegetarians may wish to avert their eyes for the rest of this paragraph. I had heard that the salad bar was really good, stocked with fresh veggies, side dishes galore, deli meats, seafood, etc. I glanced at it once, from my chair on the other side of the room, and I saw a lot of colorful food. The rest of my attention was systematically focused on obtaining delicious cuts of meat from the passing "Gaúcho chefs". However, the, umm.... "chefs" were not really "Brazillian cowboys" at all (obviously), but it certainly sounds more impressive than "waiter". And since that is the worst of the cheese, I could live with that. But the meat! Wow! Slabs of hot, aromatic sirloin, lamb, pork, chicken or sausage passed by our table at least once a minute. They ask how you well done you want it, and they carve it off. And this is not your typical beef buffet; these are high quality cuts that are grilled in ways that Outback can only dream of. After a while, one can narrow the choices of what you like and dismiss other selections. The gaucho guys just keep coming back, again and again. I treated my dad for Father's Day; I decided that is the way to do it. =)

So, is that a good segue to bringing home the "meat" of the RAC class?

First and foremost, RAC is not a panacea. Yes, it scores a lot of cool points, but do not let that fool you into thinking it will solve your problems. The Oracle Sales people are pushing it hard because it is expensive, not because of how well it helps you attain your goals. If anything, RAC is probably most optimal for a niche market, customers who have applications that are well designed for parallelism, or at the very least, DML segregation.

After you swallow that pill, most everything else is rosy. One can work around performance issues (ie, a customer wannabe who thinks he is in the niche market) by decreasing or eliminating bottlenecks. Where will our bottlenecks be? That is probably one of the hardest questions to answer at this stage of the game, because a portion of the application is still being developed. Keep in mind that our initial foray into this field will be via a highly visible, yet low-load, online program. So, here are some items I think we should start with, so as to avoid having to worry about them in the future.

  1. As much as practical, maximize disk i/o for redo and controlfiles. Put them on independent, fast, highly available disks (perhaps DMX in RAID10).
  2. Provide expansion capabilities for the interconnect, either by allowing more (multiplexing) or swapping in a bigger pipe.
  3. In regards to the portions of the overall application that we have direct responsibility for, work hard to focus on making the DML either segregated or parallel. Do not merely copy Banner coding methods, which would have horrifying results in the long run.
  4. Be generous with the buffer cache
  5. We need to decide how we want to move forward with application failover. Is it good enough to implement TAF? Or do we go with the Cadillac of FAN (using ONS)? Personally, I think FAN is like asking a electromechanical physicist to invent an automated device to rapidly accelerate a 2mm thick, 4" x 4" sheet of nylon towards a living Musca domestica. Some people refer to that as a "fly swatter".

In the context of administration, I think our group has to prepare for the coming paradigm shift. We use Grid Control a little, mostly for monitoring a couple databases here and there. That is going to change with RAC, where all the "easy" tools are distributed in a not-so-user-friendly fashion throughout Enterprise Manager. Not only that, but we are going to have to get used the concept of connecting to an instance vs administrating the database. We will have to learn srvctl, crsctl, ocrconfig, crs_stat, .... you get the picture. RAC is not merely a database with two different memory regions running two different sets of background processes; we have Clusterware, OCFS2 and ASM to monkey with. RAC does not increase the number of headaches by a factor of 2. No, it is more ambitious than that. Try a factor of 10.

I wrote in an earlier entry that we should probably take advantage of Resource Manager. Emphasis on probably. As with any and all new technologies, we have to seriously consider our business needs, and determine if the business needs drive the requirements for said technology. I am of the opinion that any "business need" that is fashioned into a generic statement like "we need 5 9's" should be fed into a shredder, burned, dissolved in acid, boiled in plasma and sent on its merry way to Sol. Ergo, ceterum censeo.

With all the documentation we have (whitepapers from Oracle and Dell, reference manuals, purchased books, recommendations from consultants and special interest groups), I am confident that we will be able to deploy RAC using Best Practices. Deploying our application in such a manner is going to be a different story.

Sunday, June 17, 2007

RAC class, the last day

What a week. And what a way to wrap it all up. I had to take some time away from it all; the class, blogging, thinking about it... But I do not want to delay too long, because I want to commit this to another media before I start forgetting about it.

Friday, we covered the last 4 chapters in 4 hours. Or more like 3, if you consider we started at 9:15, took a couple breaks and ended at 12:30. There are a couple factors why we were able to rip through them so fast. You will see.

Chapter 9: Clusterware
Due to all the issues we had earlier in the week in our efforts to remove ASM from the OCR, I was looking forward to this chapter. Why the sections that covered various CRS (Cluster Registry Services) commands were a little light, the combination of us having to dive into this stuff blind and having a great teacher like Andy facilitated my understanding of the material. Plus, the chapter goes over a bit of architecture, which I find very conducive to laying the foundation for the "big" picture.

Andy started off by addressing a fairly common problem of automatic (and frequent) instance restarts. Since the most common root problem is slow disk, one needs to introduce slight timing delays, as shown on 9-38. Basically, for each resource in CRS, you can manipulate the AUTO_START (as), RESTART_ATTEMPTS (ra), and UPTIME_THRESHOLD (ut) parameters. The page suggests that you set these to 2,1,7d (respectively) for the instance, ASM and the database.

To help speedup the interconnect, aside from increasing the bandwidth, one can also increase he tcp/ip packet size.

Since the OCR and Voting Disks are so critical, it is suggested that they be placed on isolated physical disks, and to either mirror them on the backend or multiplex them up front. Also, the CRS automatically backs up the OCR (default location = $CRS_HOME/cdata/$CLUSTER_NAME). Curious that the Voting Disks are not also backed up at the same time. Curious also that the book recommends one use symbolic links for the Voting Disk path, which makes it easier to restore a prior version (since the OCR cannot be modified directly, and it stores the path to the Voting Disk.... how else are you going to change it?).

One of the biggest problems with this scheme is that the OCR has to be synchronized with your software. If you have a database that was recently upgraded and you wish to revert to the downgraded version, you have to also restore the relevant point-in-time version of the OCR to match. That sounds like a major headache.

Andy recommends that one make the OCR 200mb, the Voting Disk 20mb.

The rest of the chapter deals with some specifics involving the OCR and CRS, giving brief examples of using the crs_* APIs. Not much to go on, as mentioned earlier, but at least a taste.

Chapter 10: Clusterware and RAC Diagnosis
The VERY FIRST point in this chapter is to make sure that all your nodes are using synchronized time (ie, NTP). Let me say that again. THE VERY FIRST point in this chapter is to make sure that all your nodes are using synchronized time. Why did I repeat that? In our lab, none of the RAC nodes were setup with NTP. This is a RAC class. There is no NTP. What is wrong with this picture? Several students in the class (us included) were unable to complete the labs on the first day because of this problem. And remember, it takes 6 or 8 ostentatious hours to rebuild the machines. So keep that in mind, NTP makes a difference.

The rest of this extremely short chapter (like 10 minutes) focuses on CLUVFY, the "cluster verify" tool. It is very handy, very versatile, and I see lots of RAC experts out there using it in their documentation. Some other highlights from this chapter include a map of the clusterware main log files (yes, Watson, a map!); we are talking about 13 different log locations. Oh the insanity! There is also a perl-based diagnostics collection script which looks like it might actually be quite useful. It is located at $CRS_HOME/bin/diagcollection.pl.

Chapter 11: Node Addition and Removal
I was looking forward to this chapter based on the problems we had with removing ASM. Surely, I thought, we were going to cover all those pesky CRS and OCR API commands we attempted to use. Ironically, Andy spent less time on this chapter than chapter 10 due to the fact that the complete chapter is obsoleted by Grid Control Release 3 (the book was written for Release 2). In a way, I was sorta glad; the chapter is simply full of screenshots, which have poor presentation quality if you are marching through them during a lecture. Bores me to death.

The one thing that Andy did say about cleanup operations that I wanted to pass along was that adding a node adds about 250mb to SYSAUX, and removing a node subtracts the same amount. So if you have a 16-node cluster, keep in mind that your SYSAUX is going to be over 4gb in size.

Chapter 12: High Availability
Finally the last chapter. Unfortunately, in an effort to breeze through this chapter, Andy reverted to reading a lot of slides which I find particularly unhelpful (because I can read them myself, thank you very much). Additionally, the whole chapter is a summary of Oracle's published Maximum Availability Architecture. But on with my notes.

As noted in my posts from IOUG, 11g will feature rolling upgrades. One of the big topics being pushed is that you will be able to upgrade from 10gR2 to 11g without downtime. I am sure there are strings attached, but we will have to wait and see. 11g is supposed to "unveiled" this coming July 11th.

The mindset one must have when developing against a RAC is to not lock tables. Obviously, there are times when you are required to lock a table, but care must be taken to do so as infrequently and as quickly as possible. For the rare occasion when you have a really long row (lots of columns, large datatypes), here are some helpful hints:
- Keep frequently used columns at the front of the table definition
- Always define precision and scale

One question I got to thinking of is if we add a node to our Chicago RAC, will the Urbana failover cluster also get a new node. It should. It better! *grin*

The database will use either standard REDO or standby REDO, but never both at the same time. Hence, it is suggested to define both, and on the same disk.

Along the lines of establishing redo logfiles, the limiting parameter (MAXLOGMEMBERS) used to specify a very hard limit. In fact, you could not change the parameter without recreating the controlfile. This has been changed in R2 (or so I am told) and the following parameters are effectively dynamic (what does that mean?):
- MAXLOGFILES
- MAXLOGMEMBERS
- MAXINSTANCES

Also, if you plan to stay away from an RMAN catalog, it would be wise to bump up the MAXLOGHISTORY to 10,000.

Andy pointed out an interesting revelation if you are successfully able to segregate all DML away from query operations; simply point your DML applications to the primary and redirect query users to a logical standby with all the required indexes. A big bonus for dividing the workload and use of the system.

Fast-Start Failover is also covered in the chapter, but Andy whipped through it with the comment "It DOES NOT work!"

In summary of the white papers and publishes Best Practices, page 12-23 has a chart. In light of everything we talked about this week, Andy made some corrections. For instance, in direct opposition to one of the suggestions, "DO NOT use Flashback Database!". Also, do not autotune checkpointing, do it manually.

I am going to close this post here, but I do hope to collate and organize my thoughts on what to "bring home" as we start our own RAC project. Definitely the Best Practices point out some good things to look into. My biggest concern is that we will not have a load to test with, hence some of the finer aspects of Performance Tuning are going to be hidden and we will have to go with what others have said.

Time for some real work. *grin*

Thursday, June 14, 2007

RAC class, day 4

We covered a lot of ground today, starting off with some really great stuff but ending the day with rather boring material.

Andy re-emphasized the need to be very careful about DML on RAC nodes, because Block Transfer and Cache Fusion can really bring down your performance. Some general strategies are to keep all DML on one node (and master the block locks for those associated tables on that node). Or you can intelligently design parallel DML across more than one node (remembering to include partitioned tables, even if the tables are a measly 20k rows). Also, when a truncate operation is performed, it must clear out all relevant blocks from all nodes.

So, Global Cache Services (GCS) is the big archWait we need to watch out for. There are four distinct problems with GCS:
IssueSolution
InterconnectIncrease bandwidth with more interconnects and/or a bigger pipe
Downgrade Global Cache locksCommit more often
Flushes to RedoIncrease size and # of redo logs
Building and shipping blocks via LMSIncrease number of LMS processes (accomplished with an underscore parameter)

The LMD process requests locks for a block
The LCK process manages block locks
The LMS process builds and ships the block

You can see interconnect waits in v$sqlstats.
v$segment_statistics has RAC-relevant groups:
- Global Cache Services (GCS)
- Global Enqueue Services (GES)
- Messages Sent


We spent a bit of time covering Services. I mentioned this on day 1; this isn't Kansas anymore, Dorothy. I am glad Andy spent so much time going over Services. However, he also spent a lot of time going over Resource Manager which is when my attention started to slip a bit. But lets go over Services first.

Services are a grouping of sessions that do the same kind of work.

There some really good attributes listed in the book (page 7-11). I can post them later, but there is a bit there so I am saving that for another day. Besides, I do not have the book here with me. One can create services in DBCA, EMGC or even on the command line if one is so compelled. Although I advise against the CLI for the first-timers. I found the lab to be very helpful in introducing services; however, even though I got to create a service and connect to it, I still have a hard time visualizing how we would use them in "real life." I think this is why Andy tied it to Resource Manager. I do know that you can aggregate statistics to a service (in fact, if you do some Performance monitoring via EMGC, you can group by Services rather easy), but using Services with the intent of governing resource usage seems to be the key thing in a RAC environment.

Andy walked us through the evolution of Resource Manager. Personally, my experience with Resource Manager has all been negative, even after using it on the OCM. I simply found it a bother and impractical. I guess if you do not use it often and have no business reason to use it, it does seem like an extraneous feature. So, anyway, Resource Manager allows one to cap the resource usage for your sessions, or group of sessions. You can manage cpu usage, estimated query time and UNDO space. With 10g, you can also limit idle time, including and specifically idle time spent blocking another session. We should probably take advantage of that at AITS. *grin* Yes, I was able to find one practical, down-to-earth use for this beast.

However, we started discussing the rest of chapter 7 (services) and I was very distracted. First off, there is just too much junk in Chapter 7; too many slides showing you pl/sql code that Andy simply read. How useful is that. It was almost as if he was bored with this as well. *grin* Additionally, I got to reflecting on all this stuff. Oracle is encouraging us to manage the storage (ASM), the OS, database and application tier (Grid Control), everything to do with users and connections (Resource Manager).... and, as Andy said, "EVERYTHING!". Why? Why does Oracle want to push DBAs to have control of all this stuff? I am not comfortable with it. For one, how in the world can you expect anyone to be good at everything? Secondly, there are folks who already do a good job at a lot of those tasks, so why cross-train a database-specialist in a new field? Is Oracle so greedy and power-hungry that they want everyone to think like they do? I do not. I am a DBA, and a DBA I want to remain.

Do not get me wrong; I think it helps to step over the line a little bit, but in both directions. It helps for the DBA to know some about the storage, and helps for the storage guy to know a little about the database. But I do not agree in the trend to consolidate job roles. Bad idea.

This was my big complaint for the rest of the day. We jumped into TAF (Transparent Application Failover) which was really cool, followed closely by FAN (Fast Application Notification) and ONS (Oracle Notification Services), but based on Advanced Queueing, and both of which I despise. Again, why should the DBA take over the responsibility of guaranteeing that the application is always available? I am not arguing that the DBA should be clueless and ignorant, but rather that there should be no expectation that only the DBA has responsibility for those strategic goals.

So, let's go over something worthy of class report. *grin*
Transparent Application Failover

First, a huge warning for those of us running 1 listener to serve lots of databases with lots of concurrent connections. The Oracle listener can only handle 110 connections per second. Ergo, Andy says
Everyone should have more than 1 listener for Production Databases!
He also suggests that if you do use more than 1 listener, it would be wise to avoid the default port of 1521, which can be very confusing. 10g databases with detect all listeners anyway, so do not worry about LOCAL_LISTENER.

To start at the basic end of TAF, one can enable Client load-balancing by using tnsnames.ora parameters LOAD_BALANCE and FAILOVER. Using at least 2 listeners, you can use these parameters to load balance and failover between the two. This works even if the listeners are on different nodes, or pointing to different instances.

Next, one can enable connection load balancing via REMOTE_LISTENER. And finally for the really good stuff. When a connection is made (create session), the dedicated server process registers against a table and stores information in the HEAP of the Shared Pool. In a RAC environment, the Shared Pool HEAP is automagically conveyed to other nodes via the interconnect. Meanind that remote nodes are always kept up to date with session information like execution status. So what happens if the node crashes? Since the session information persists in the HEAP of a remote node, the session can simply failover to another node. How cool is that!?! It almost sounds too good to be true. I guess it is possible to be too good; perhaps I missed some critical details, or misunderstood what Andy was saying. Need to research that a little more.

We talked about some other tnsnames parameters like PRECONNECT, which can establish a shadow session on a secondary instance (ie, Standby). If the Primary fails, the failover occurs faster because the sessions already exist. The cost of doing this is that it takes longer to create the initial session.

We spent a bit of time talking about FAN and ONS, which then merged into a talk about the Scheduler and more on the Resource Manager. My brain started to check out, as per my rant above. To quote Andy (literally), "Blah blah blah blah."

One other thing I want to bring back to work is another question. How are we planning to keep our middle-tier Highly Available, specifically for Global Campus? I think we have the database nailed, especially (but not only) because of RAC. What about that application? =)

Wednesday, June 13, 2007

RAC class, day 3

We had another long, grueling day. I am the kind of person who loves labs, so since we did a ton of lecture today, I found it to be quite burdensome and hard to maintain my mental stamina. What kept me on edge, honestly, was the problem my partner and I were working on. As mentioned in a previous comment, we installed our ASM in the wrong directory and attempted to correct it by removing it via the OIU. Big mistake. That led us down a 2-day path of learning and excavation. *grin* This afternoon, I discovered Metalink Note 357261.1, which turned out to be a lifesaver. I had to try it a few times before we finally got it right, but we did get it right eventually. It was a mess.

So, on to the material we covered. Andy gave us a lot of good stuff; no, not a lot, TONS of good stuff! Too much for me to go over in one sitting. I took a lot of notes, so I will type as much as I can, but I am quite tired.

We spent a majority of the day talking about internals. In retrospect, it is easy to find oneself being treated to one of Andy's many expositions about Oracle internals; he is long-winded afterall, and he does not jabber about innocuous, irrelevant drivel. First up was supposedly about ASM, but turned out to be more in the context of Disk i/o. He repeated something I had heard at Dell, but Andy went into such great detail that you cannot help but to utterly believe him.

Your harddrive is a spinning disc, and as everyone knows, the outer edge of the disk spins faster than the inner edge. Typically, the outer edge of the disk is filled first. One of Andy's findings was that if you have data on the inner edge and have to move the disk head that much, you can slow down your disk access by as much as 4 times! Solution, do not put anything on the inner edtge. You can take that one step further and limit your disk use exclusively to the outer edge, vastly minimizing seek time and disk latency. So for a relatively significantly speed-up in i/o timings, one can give up 60% of their storage and only use the outer 40%. I am going to have to head over to Kevin Closson's site some time and read what he has said about that. I know this is an old idea that has been given a lot of airtime, but is it practical? Andy did qualify his statements and specified that this strategy is not practical for a majority of the datafiles; you would probably only want to use this on something like REDO logs. Something small and requiring high speeds.

Additionally, 2k inodes can typically consume 7% of the disk. Consider using larger inodes, like 16k, which can reduce "wasted space" down to 1%.

I also got to thinking, why do disk manufactors not make separate heads for separate zones (outer, middle, inner)? Surely I am not the first to think of this, hence there must be a good reason this is not done. Google.....

We talked about the ASM architecture, focusing on the RBAL and ARB processes. We also got into BIGFILEs, the encoded block id # and ramifications. I will get into those another day.

We took a lengthy diversion into RMAN, learning about the history of the Recovery Manager and working up to the current version (10g, no 11g secrets unfortunately). Andy mentioned, amongh other things, how Oracle is replicating significant portions of data between the controlfile and the recovery catalog, paving a path for obviating the catalog. However, the catalog still is quite handy when using Data Guard, so it is not going to die quickly. Talked about 10g features like Block Change Tracking, Block Compression and Flash Recovery Area. One thing I did not know was that one can use the Flash Recover Area (FRA) as a hot failover in case of the primary datafiles becoming corrupted. For example, if your System tablespace becomes corrupt (a recovery scenario that requires the database to go down), one can opt to failover to the FRA instead. How cool is that!?!

After that, we dug into RAC Performance Tuning. Again, we covered a lot of ground and I can only skim the treetops here, now. Talked about about the impact of remastering global cache locks, and talked more in detail about what role the LMON and SMON processes play in instance recovery. Even though 10g introduced FAST_START_MTTR_TARGET, this parameter is not used (or rather, it is ignored) in RAC, there is an alternative with an underscore parameter; _FAST_START_INSTANCE_RECOVERY_TARGET.

RAC tuning follows these steps:
  1. Application tuning; not merely rewriting the sql, but more importantly segregating the dml intelligently to reduce block transfer
  2. Resize (read, double) and tuen Buffer Cache
  3. Reduce Full Table Scans in OLTP
  4. Use ASSM
  5. Increase the size of caches for sequences
  6. Use partitioning to disperse traffic
  7. Avoid hard parses
  8. Minimize locking
  9. Remove unselective indexes

We talked a lot about blocks and extents. Specifically data blocks and segments under manual or automatic control. Andy was adamant about explaining the underlying architecture (aka, internals) because, as he says, "If you know how it works, you can figure everything else out." I believe Andy is on a mission to explain exactly how badly Block Transfer (via Cache Fusion) can really bring down your interconnect, which is the lifeline of your RAC. Keep in mind that for a very simply scenario, if you have a dirty block that needs to be shipped to another node, you have to also ship two undo blocks (undo header and the undo block itself). So that is a minimum of 3 blocks per request. Now, what if your are accessing a row that has an index? Or 10 indexes? With a depth of 6? So now we are talking about an extra 10 * 6 blocks (a branch header block for each branch level for each index). Now, what if you are making an update to that row? Now you have to send all the associated undo for those index blocks. What if you updating a 10 million row table? How big is your interconnect?

ASSM, or bitmap blocks, and multiple freelists help reduce the massive impact, but their contribution can be minimized. For example, using freelist groups is good, but with a lot of dml, freelists can migrate back to the header block, which is where you do not want them. ASSM also has a downside in pure Data Warehouses, because master block locks can exponentially increase.

We also went over ITL Waits. Andy calls these a poison pill (he has found a few poison pills, by the way). They are rather wicked because 1) they do not throw an error, 2) few tools can recognize them, and 3) they can really bite. Apparently, ADDM can report them. You also want to watch out for Row Migration (checked from CHAIN_CNT); migrated rows maintain a persistent hold on the ITL slot (as opposed to releasing it). And chained rows need 2 undo blocks.

Lastly, indexes. Skewed indexes in RAC are bad (look up a couple paragraphs about the interconnect freight train; moving all those index branch blocks). You can use reverse key indexes to improve ordered, monotonically increasing keys like sequences. Or, you can cache a huge chunk of sequence numbers per node, which help to disperse the range of updates to indexes. One other last trick is to isolate all DML to one node so as to significantly reduce Block Transfer. Did we say Block Transfer can be the death of your system?

That is all I have time for tonight. I hope to go back at some point and hit the "lesser" points. We did have more famous Andy pictures. I guess I could say, "Wish you were there."

Tuesday, June 12, 2007

RAC class, day 2

We had a long day today, so this will be significantly shorter than yesterday's update. The good news is that we had a ton of hands-on experience today; almost 80% pure labs and very little lecture. Just the way I like it. =) We installed and configured Clusterware, ASM, the Agent and the database software (but not the database, yet). Amazing stuff.

Yesterday's news about Grid Control R3 was extremely exciting, but when I got home I was disappointed. I specifically asked if he meant 10.2.0.3, and Andy specifically said "No, Release 3". Well, he referred us to oracle.com/enterprise_manager, and what do you see? 10.2.0.3. I installed 10.2.0.3 into our Production Grid Control 2 weeks ago. Oh well. At least it is encouraging to hear that the version we have installed is receiving so much good press. In fact, Andy is of the opinion that Release 3 is the only version worth getting. Not quite sure what that means.

Since we were heavy into labs today, I did not take many notes. Also, we are spending an extra hour at class these next couple days, not to mention that I am going in early tomorrow. Bottom line, short post so I can get to bed. *grin*

But here are some tidbits. This will probably be old hat for those who already have experience with ASM, and keeping in mind that we are working with Grid Control Release 2 (some of the things we did today are packaged much better in Release 3, or so I am told).

Order of installation:
- clusterware
- ASM
- Oracle Software
- Management Agent
- Database

It is possible to use sqlplus to shutdown all instances in a cluster, but unwise (use svrctl or GC instead). Shutting down each node individually is like an instance failure; read yesterday's post about remastering locks.

Since ASM only allows one datafile per tablespace, might as well use BIGFILEs.

As we were setting up user equilvalence (a fancy way of saying ssh key-pairs), I got to thinking "Why is this so complicated, so prone to error?" There is so much more that Oracle could do to automate the preparatory steps, even if only to provide a script to do it. Given all the automatic that Oracle Grid Control does (heck, you can provision a Gold Image unto bare metal!!), I am surprised these little things slip through the cracks sometimes.

RAC redo logs experience significantly more i/o than non-clustered databases. Andy recommends that the redo logs by physically isolated diskgroups. While you are at it, why not make them raw (since you are never going to back them up or recover them anyway, right?).

Back to Undo segments. Still on the thought that Undos are nothing more than Public Redo in a LMTS, Andy went to lenghts to demonstrate "how to eliminate Snapshot too old" errors. To boil it down and steal his thunder, the simple solution is to double the size of your UNDO tablespace. Or quadruple it. However, the longer answer involved more of Andy's famous pictures, and I learned a lot about Undo segments and how transactions use them. More later (hopefully).

As we were wrapping up, Andy talked about srvctl needs the name of the database as a parameter. This is because your OCR (Oracle Cluster Registry) is a solo act, there is only one; and the OCR has information for all the clustered databases on the node. However, I am thinking to myself, "Hmm... our development hosts have 40 or 50 databases, but sqlplus does not need the database name". Srvctl really came back with a vengeance; it is going to take us a while to get used to it. I think it is conceptually cool, but it adds an unwelcome layer of complexity.

Ciao

Monday, June 11, 2007

RAC Admin class, Day 1

Wow, for a first day, Andy sure packed a punch. I am up in Chicago taking the 10g Real Application Cluster Administration class with Andy Fortunak. Andy is great, and I refuse to take any more DBA classes from other instructors unless they are on my short list (Andy Fortunak, Sue Jang or Rick Pandya). I am not going to go over Andy's bio at this point, but he really knows his stuff and has been teaching for quite some time.

So, I was one of 12 folks in the classroom, each of us occupying one computer. We were given 3 class books (or rather, powerpoint printouts with embellished notes); two of the books are rather thick, the third is merely the last 3 appendixes (of 5). Actually, to put things in a little perspective, I learned a little later during the day that one of the fat books is 550 pages of lab material. Amazing! Full solutions with screenshots and step by step instructions. 550 pages!!

Andy is long-winded. Usually, this is a good thing, as he starts talking about the nitty-gritty details, stuff you will never hear from a sales person, a talking head or any Support Engineer. Probably not from any other instructor, either. But it took us about 30 minutes to get around to general introductions. The class is 5 days long, with 12 sections and the intro. Despite the books "recommended" curriculum, Andy plowed ahead with his own agenda. Typical Andy Fortunak. We spent the entire day (until well past the "normal" closing time of 5:pm) talking through the Introduction. And boy oh boy, what an intro.

First off, myth 1: RAC will make your system run faster.
Truth: RAC has the potential to make certain things run faster (especially in a warehouse), but more than likely, your application will run slower. Or maybe you will not notice any performance difference at all. It depends.

Services are big in RAC. Service Names are just the tip of the iceberg. A majority of 10g RAC Services are applications (ie, code, pl/sql, java, etc). While little used, Resource Manager is meant to be used with Services and can be quite powerful when wielded correctly.

Later on (later in the week), we will go over Backup & Recovery for the OCR file; it is that important.

Ignore Enqueue Services (GES), tune Cache Services (GCS).

A sidenote about Banner and RAC; BAD IDEA!! At least, until the fine folks at SGHE can rewrite the application to take advantage of parallelized operations. Otherwise, performance is going to suck and not worth the investment. If High Availability is a requirement, there are other alternatives with much better cost-benefit ratios. I cannot support the idea that the U of I be the ones on the cutting edge in terms of RAC. SunGard really needs to lead the charge in this, and they need to prove their leadership with large loads.

The Keep pool is important for RAC. More later.

RAC does not guarantee 100% uptime. If 1 node goes down, the database enters a brown-out time while the Lock Monitor (LMON) processes broadcasts to all surviving nodes that they must quiesce while the observing node remaster all global block locks. That is fancy-speak for saying that the a lot of work has to be done to redistribute the work that the now-dead node was doing. It takes a lot of time and resources to do that.

Automatic checkpointing (ie, fast_start_mttr_target = 0) in RAC is not good. You want better control over the checkpointing process because of all the extra events that trigger checkpoints and redo log buffer flushes in RAC.

Global Hash Partitioned Indexes can be good for non-partitioned tables with a sequential key.

Myth: UNDO is the next evolutionary step in Rollbacks.
Truth: UNDO = Public Rollback segments in a LMT (Locally Managed Tablespace)

RAC traditionally attempts to meet two objectives: scale-out and speedup. Scale-out is usually obvious; you add more hardware to handle more load. Speedup is more elusive; under optimal situations, you may see a max of %69 faster operations.

Andy drew many diagrams on the whiteboard which I will not replicate here. He went to lenghts to demonstrate how the degree of parallelism can quickly consume resources. This is mostly due to the fact that you get (2 * parallel degree) query slaves, each of which inherit specifications like Sort Area Size from the PGA. If your sort area size is 1Mb, and you set parallel degree at 4, you are not consuming a mere 1MB for sorting, but rather 9MB (1MB for the PGA + 8MB for the query slaves). Likewise, you also exponentially exacerbate the load on the CPU for each degree of parallelism. You start to get the picture that RAC is about parallelizing work. Tread here carefully. RAC solves memory and CPU bottlenecks by distributing the load among several nodes.

Global Temporary Tablespace (or Temporary Tablespace Groups) are also another big thing. RAC has the ability to automatically associate an affinity between a Temp Tablespace in a Group with a participating instance in the node if you have at least one "Global" Temporary Tablespace per node. Also, since Temp tablespaces never need recovery (recovery is impossible), Andy suggests that you never mirror the disks (Striping, ala RAID 0, is still a good thing, though).

Andy dropped a hint that Enterprise Manager R3 is out, and highly recommends using it over R2. Clarification, the 10.3.0.x version of Grid Control, not 10.2.0.3 of the database. He is supposed to provide a link for that in the near future.

Expect the SYSAUX tablespace to grow very quickly. This is because each instance maintains its own copy of the AWR, which is stored in SYSAUX. The default retention is 7 days; imagine what happens if you want to save 32 days.

Like Temp Tablespace Groups, Andy also recommends one UNDO tablespace per node. Alledgedly, there is a parameter that aligns each UNDO with a specific node.

As opposed to the well-known practice of multiplexing controlfiles 3 ways, Andy suggests that controlfiles be multiplexed 2 ways over mirrored disk (RAID 1, not unlike a REDO setup). This is mostly for recoverability purposes; RAC is going to hit the controlfile even harder.

Andy says:
Oracle is moving away from Recovery Catalog like nobody's business

Apparently, Oracle is focusing more on the controlfile to keep track of RMAN. However, the one exception is that the catalog can still be very useful with Data Guard, due to the fact that the standby controlfiles are different than the primary's.

RAC Redo logs can be configured differently for each instance. I have yet to find out why (why it is possible, and why you would want to in the first place).

Andy was very concerned about the interconnects, Cache Fusion and Block Transfer. He is of the opinion that you need to spend the big bucks on the interconnect hardware. Interesting how Dell did not have quite the same opinion. It will be interesting to see how things bear out for us, but I suspect that with our small load, we will be fine with the "commodity" Gigabit NICs for the meantime.

Andy was also quite adamant about not using the Flashback Logs. I think he will follow that up with more details later in the week.

We spent a good amount of time on Lock Manager. At startup, RAC propagates a complete list of all blocks and assigns a lock for each one (yes, every single one) to each participating node. Each lock starts with NULL, and there is quite an interesting scheme to upgrade and downgrade block lock levels. What it comes down to is that downgrading a lock take a long time. Andy went to pains to demonstrate this with an extremely detailed picture (he is known for such pictures). Also typical of Andy, he provided a number of "extra" slides that he had developed over the years (some dating back to 8i OPS), and gave some good reference information for the various lock states. Essentially, it is a very expensive operation to downgrade a lock because the node that masters the lock must flush the redo log buffer. So if Node 2 requests a block lock for a block that is already in Exclusive mode on Node 1, Node 1 must first downgrade the lock causing the log buffer to be flushed, then upgrade the lock mode for Node 2, and finally transmit the block and associated Undo information (2 additional blocks) to Node 2. You can see why an OLTP system would have heyday with that. And as mentioned earlier, when a node crashes, all those mastered block locks must be redistributed amongst the surviving nodes.

And that was only what I wrote down. There is a lot more on the whiteboard, in the class book and in the supplemental slides.

Tomorrow we will dive into a full day of labs. Andy has promised that it will be challenging and frought with sandtraps; if we make a mistake, it takes about 7 hours to rebuild the computer. Don't ask me why; I have always been amazed at how disruptive and slow these computer rebuilds are for Oracle classrooms. But back to tomorrow. We are paired up such that each pair of computers will become one node in a two-node cluster. One person will be Node 1, who, it is rumored, will do more work. Node 2 must be done in a synchronized, serial fashion. I guess we will learn more. My partner is Bob, a Sys Admin from Northern Trust. I figured between the two of us, we have all the OS and DBA skills covered. =)

The rest of the week promises to be grueling. We will be starting at 9 and marching all the way to 6 (at the earliest). Hopefully we will have some reprieve on Friday. I am excited about the lab section tomorrow; we will be doing some raw install and configurations, starting with naked hardware. I hope to pass this experience on to my colleagues who are eagerly awaiting hardware of our own.