So, we are attempting our first "real" RAC install; not canned, not pre-fabricated, but using only software downloaded from OTN and following Oracle Documentation and various forms of cliff notes. This is one of those things that is really sweet if it works 100%. Otherwise, you are in for a headache. We have a headache.
That RAC class was good for teaching things, but it also perpetuates a false sense of security when things go wrong. And from what I can tell from all the notes and pleas for help out there, things go wrong often. One of the mantras I hear is to follow the documentation exactly! This is all good, but the documentation itself comes in many forms. Do you follow what Oracle has said, or do you pick some expert (K Gopal or Julian Dyke) and follow what they say? Cluster Verify (cluvfy) is also a little misleading; it will not check for 100% compatibility with the installed RPMs. In fact, I even had one Oracle Support analyst tell me that was the DBAs job. That is a lot to swallow. Take a DBA who does not know anything about Linux and tell him to verify that 24 RPMs not only exist, but are compatible with the required RPMs. I tried to write a script for it, but in the end, the only "failsafe" way to do it is by hand. I say "failsafe" because human error plays a large role in these RAC-related problems as well.
It would seem to me that one good way to eliminate, or at least reduce, human error is to automate. Dell IT has taken this to extremes and automates a vast majority of their day-to-day tasks. Checking the RPMs is just a small fraction of something that could easily be automated. What about user equilvalence? What about all those silly root scripts? Or running oracleasm to configure and create disks by hand? What boggles my mind is that 10g RAC does so much that is really cool and automated; when the sun shines, life is good! Why are some basic things left out, but you have some nifty tools like cluvfy that is really slick at verifying a good chunk of your install work?
Ironically, our CRS installation was hunky-dory. The rootpre.ksh was a bit weird (why is it checking for 9i CM??), and double-checking all the paths and homes is about the only thing that slowed us down. Things went south when it was time to install ASM. Our first warning flag was that the swap space was not big enough. Thinking it was a red herring, we ignored the warning. Later on, after the software was installed and the configuration assistants were running, we hit our first major roadblock; link not satisfied on njni10. Not much that seem relevant on google or metalink. Oracle Support told us to attempt the installation again. Now think about this; the analyst assigned to us specializes in NetCA (that is why we filed the SR). This guy tells us to simply re-install ASM. Having had ASM problems in class, I was not exactly happy about that. Remove Oracle Homes, zero out raw disks, make sure no processes are running, and away we go. This time around, ASM cannot see all the disks. So when I tell my support analyst that we have new problems, he has to bring in a database specialist because the original guy does not know anything about ASM. What a joke! On top of that, he "reminds" me to keep the scope of the SR to one issue. GRRR!!! Of course, we are subjected to the usual onslaught of new questions and request for an RDA. I am actively ignoring them. We were able to work around a large number of our problems, but in the end, we want to simply wipe the slate clean and start over.
Deleting everything and wiping the slate clean is not easy. No sir-ee. This is where having root privs come in really handy, because of someone's ultimately wishful thinking, the CRS Oracle Home is installed with root as the owner. By default, oracle does not have any privileges to remove or modify anything in the directory, and only limited privs to execute anything. For instance, running crsctl evokes a "not enough privileges" error. Not to mention the slew of root-owned processes (crs, css, emv) that have to be dealt with.
On a separate note, we were supposed to have a webinar with our ERP vendor (SunGard Higher Education, or SHE as some say) on the topic of Oracle RAC. *cough cough* I went with the intention of mildly heckling them, but they had technical difficulties with the virtual presentation. Sounds like even putting the letters R-A-C on something is prone to make it break. *grin*
Seriously, though, I know we will not be moving towards RAC any time soon for our production ERP system, and I am very curious to see how other schools manage it. In a morbid sense, I am also curious if they are buying the line from some sales person about how it will help their system, or some form of HA. RAC looks great on paper, but after scratching the surface as I have, it ain't all that pretty underneath. Don't get me wrong, as I mentioned earlier, it does a lot of cool stuff, and it does it well. But there are two sides to that coin, so it would be wise to keep things in perspective.