Tuesday, July 24, 2007

RAC is not easy

I have a good friend to whom I like to complain a lot (sounds like my good friend is getting the short end of the stick, doesn't it?). With good intentions, he sincerely suggested that RAC is easy. I put in a fence, working 43 hours over a 4-day weekend. Shall I call that easy? It only lasted 4 days. He is still a good friend, he just thinks everything is easy. *grin*

K Gopalakrishnan once said:
Believe me RAC install is very simple and straight forward if you religiuoly complete the pre requisites.

Ok, enough of the griping. My hopes in posting up some of our specific dilemmas is to document what we are doing, since so often I try to google for specific errors and never find anything useful (I am not counting the Chinese sites, because even after Google translates them, I am not sure I see any resolutions apart from the questions).

So, first problem. I am following Metalink note 357261.1, very religiously. That was a mistake, as it is not complete, yet. Anyway. You attempt to remove asm:
srvctl remove asm -n urbdb1,urbdb2

You check to make sure it was removed:
srvctl config asm -n urbdb1
+ASM1 /u01/app/oracle/product/asm

It is not, so try again:
srvctl remove asm -n urbdb1
PRKS-1033 : Failed to remove configuration for ASM instance "+ASM1" on node "urbdb1" from cluster registry, [PRKS-1023 : Failed to remove CRS resource for ASM instance "+ASM1" on node "urbdb1", [CRS-0214: Could not unregister resource 'ora.urbdb1.ASM1.asm'.]]
[PRKS-1023 : Failed to remove CRS resource for ASM instance "+ASM1" on node "urbdb1", [CRS-0214: Could not unregister resource 'ora.urbdb1.ASM1.asm'.]]

What do you do?

Oracle Support has told me that crs_unregister is buggy and not supported. *cough cough* But I am going to attempt it anyway, since Bill Wagman had some luck with it (if you follow the discussion from oracle-l, you will see that Peter McLarty suggested it).

/u01/app/oracle/product/crs/bin: crs_unregister ora.urbdb1.ASM1.asm
CRS-0214: Could not unregister resource 'ora.urbdb1.ASM1.asm'.

/u01/app/oracle/product/crs/bin: oerr crs 214
214, 0, "Could not unregister resource '%s'."
// *Cause: There was an internal error while unregistering the resource.
// *Action: Check the CRS daemon log file.

Grrr... How quaint, check some log file somewhere on your system, and that will solve all your problems. Having no idea where my "CRS daemon log file" actually is, I use RDA to browse around and finally come up with /u01/app/oracle/product/crs/log/urbdb1/crsd/crsd.log. Unfortunately, the CRS daemon log file is not helping me much. What am I looking for?

Update: 5:19 pm
After a day of reading manuals and discussing options with the fine folks on oracle-l, we still have the same problem, albeit now I have quite a few new tools on my belt. Yes, RAC is not easy, I think I have proved that. To be a little more granular, working with the OCR is a pain in the butt.

So, new tools.
  • strace: a very low-level OS trace utility. I did not benefit from this, but I was able to show the output to others smarter than I. I used it on srvctl and crs_unregister.
  • The "force" flag (-f) of certain commands, like srvctl. I believe it removed something, but I do not know what; I still have my root problem.
  • Appendix A of the Clusterware Deployment and Admin Guide: has a ton of information, most of which would probably be helpful under "normal" circumstances. Did I mention we still have our root problem? However, I have to give credit to the authors, for they did a great job. There is a lot of information about log file locations (wish I knew about that earlier), how to debug various components and resources, and some descriptions of the syntax used for commands. I thought the OCR section was quite thin; perhaps I am biased because I am looking for a specific solution.
  • SRVM_TRACE=TRUE: This is documented in the above Appendix A, but I point it out because it spews out a bit more information. While not immediately helpful, it seemed like something that I should file away.
  • USER_ORA_DEBUG: mentioned one time in the Appendix, I found out that you could crank this all the way to 5. I have no idea what it does or what the appropriate values are - google is not giving much on it, yet.

More to follow. My Support Analyst just requested that I reboot the nodes because he has no idea why the resources exist in the OCR, as ocrdump does not list them.

Update: 11:22 AM, Wednesday
LS Cheng on oracle-l pointed out what ended up being the winning goal.
crs_stop ora.urbdb1.ASM1.asm

I still do not completely understand why this is an issue. Or even how one determines that is the solution. I hope to hear back more from LS Cheng so we can understand how he arrived at that conclusion.

What we ended up doing was restoring ocr to a point before we attempted to follow note 357261.1. Since the services were already down, it was straight-forward to delete the databases, the ASM instance and finally the ASM database. I was actually surprised it worked so well, given all the problems and headaches we had yesterday.

I will add another update when I learn more. Right now we are happy we have a RAC install back in working condition and can move forward with our projects. Oracle Support did not score any points in this round. This is becoming a bad trend.


Anonymous said...

I saw an ad for currency trading that said:

"The Euro is Easy to Trade"

but it did have this nice disclaimer:

Trading involves significant risk of loss and may not be suitable for all investors.

by analogy RAC is easy to install, but may not be suitable for all SA/DBAs/shops.

as one of the perpetuators of the easy tagline, maybe I'll start using a disclaimer as well like "RAC is easy .. with NFS"

Charles Schultz said...

- RAC is easy... if you managed clustered software in a previous life

- RAC is easy... if your environment is exactly like the documentation you are following

- RAC is easy... if someone else is doing it

- RAC is easy... once you get it installed and configured

Anonymous said...

trapped too here :)


Anonymous said...

Genius. Thanks for sharing this.

valiantvimal said...

Hi, I too got trapped here. But happily restored from OCR backup and now I am able to remove unneeded resource. It seems it is the only possible way to do what we desire.