Thursday, December 04, 2008

Upgrading clusterware/ASM from 10g to 11g - a complaint

I have a RAC setup with 3 environments; DEV, BCP and Production. So, starting with my development cluster, I made an attempt to upgrade the clusterware and ASM from 10g to 11g. That was quite painful. I will not spend much time complaining about the documentation here (and there is a lot I could say *grin*). But allow me to take a little tangent.

I collected all my "constructive criticisms" and suggestions into what I hoped was a cogent document and fired it off to Oracle Support and Erik Peterson of the RAC SIG. Erik forwarded my note to a document manager, and his response was overwhelmingly encouraging and very understanding. The Oracle Analyst in Support that picked up my case was also quick to understand my frustrations. Both of these two different groups were able to collaborate and redeliver a nutshell document that was supposed to supplant/supplement the online documentation.

For the clusterware upgrade, they hit it on the nose. I was able to upgrade the clusterware in my BCP environment with no problem. Well.... I admit, I am still very confused about the CRS and ASM software owners. There seems to be a push to use different users (in both cases) other than the normal "oracle" user. There is not much justification in the documentation, IMO; rather a bland "for security reasons" statement.

The ASM upgrade was a little more tricky. All the documentation that I have, including the new nutshell cookbook, indicate that ASM can be upgraded in a rolling fashion. That is a lie! *grin* I have a new SR open that is exploring this little nuance (or not so little, depending on how much time you have spent trying to wrap your head around this). Apparently, ASM Rolling Upgrades are available in 11gR2. Oh..... well, guess I'll just go download 11gR2 then. Wait - it is not available yet?!? So the documentation lists steps that are not applicable to any version that is currently available? Did I read the right?

So confusing. The good news is that I was able to complete the BCP upgrade with less pain. Since the documentation strongly suggests that one create a new ASM user, there are a ton of manual steps to follow if you choose to go that route. However, smart DBA that I am *sly grin*, I kept the same user and was able to use the DBUA to go from 10g to 11g (10.2.0.3 to 11.1.0.6) - that was sweet. Why would I want to switch users and go through many manual processes? What is the rationale for that? The patchset upgrade from 11.1.0.6 to 11.1.0.7 was less friendly. As I mentioned above, all the documentation states that you should do a rolling upgrade. This is not possible. So I hacked my way through; had to shutdown abort my instances which left shared memory segments hanging around that I had to manually kill off. Not a big deal, but the error messages can be very confusing (ie, ora-12547 TNS lost contanct).

I am looking forward to the document changes that will result from my contacts with Oracle.

13 comments:

Dan Norris said...

While I don't want you to make your postings into hate-fests, I do think it'd be helpful to post a little more detail about the errors you encountered so that others finding the same errors could find your solutions/workarounds to them (even if you were following incorrect documentation when you got the error). I'm glad you engaged Erik, too, he's awesome.

I've had several occasions to interface with the team that writes Oracle documentation, particularly database documentation. The people I know in that group have been at Oracle quite a long time and desperately want to do a good job and make the documentation better. I'm not surprised that you got good feedback from them either--they're probably starving for direct feedback like the kind that you likely gave.

Charles Schultz said...

Good point. And I really, really hope this is not a hate-fest. =) I encountered problems, and I hope others do not. Here is what I sent to them:

I document below the problems we encountered, but to summarize:

• The documentation is quite confusing, and in some cases, just plain wrong
• An Oracle-provided cookbook would be greatly appreciated


I started off with the documentation that came with the clusterware. We wanted to do a rolling upgrade, so we started with Appendix B. B.1, backup software, no problem. B.2, restrictions met, no problem. B.3, verify system with CLUVFY. Well, the document does not suggest which parameters to use. And then it says to see the “Oracle Database Upgrade Guide” for more information. What is that? Where is it? Why is it relevant? Note that I did find the book online.

So we turned to the online documentation and found a section that outlines more info about the utility, and took a stab at what we thought was needed. Check.

Section B.4 is where we really fell apart. This section is extremely confusing and poorly written. We are doing a ROLLING UPGRADE, yet the documentation states in the very first sentence “Shut down any existing Oracle Database instances on each node”. It further says “To upgrade using Oracle Clusterware, you must shut down all Oracle Database instances on all cluster nodes before modifying the Oracle software.” Section B.5.2 perpetuates this confusion. To make it worse, section B.6 is the step that should come first – in fact, I cannot fathom why step B.5 is even listed since this is the shipped version of the 11g clusterware, and no patching will ever be done using this particular software. That being said, section B.6 is extremely light on details and not complete in any sense of the word.

So, moving right along. Section B.6.4 suggests that we run the preupdate.sh script on each node we want to upgrade. However, it does not say when. Again, this is a ROLLING UPGRADE, and we do not want to shutdown each node all at the same time. We decided to follow the actions on each node ONE AT A TIME. Also, this section does not offer any hints about what parameters to use for preupdate.sh. Going back to section B.5.3.5 we can find a very perplexing explanation about the crs_user. Again, we took matters into our own hands and named the crs user the same as our oracle software user “oracle”.

The GUI was fine, with the exception that we did one node at a time. Again, the whole “rolling” phase is very confusing and it was not clear to us if we wanted to deploy the software to both nodes at the same time.

The rootupgrade script of B.6.6 could be clarified a bit better (especially in terms of B.5.4.6); in previous installations, the script actually is $CRSHOME/install/rootupgrade – that should be explicitly stated in both subsections.

B.6.8 segues into an ASM upgrade which is helpful, but again there is no Database Upgrade Guide in the included documentation (not even for the 11.1.0.6 install package).

Jumping next to patch up to 11.1.0.7, we started with section B.5.4 in the clusterware documentation. While that is mostly ok, there are some steps that are obviously a bit weird. For example, B.5.4.5 says that one should run preupdate.sh after the patch runs. Why? Why would someone run a “pre” script after, and why run that script in particular that shuts down the CRS? We correlated the steps with the information in the 11.1.0.7 readme file – rolling upgrades are not clarified much over there either (see 7.7.2.1 Rolling Upgrade). The basic thought seems to be “shutdown all services on one node, install the patch, repeat for another node”. This sounds good on paper, but this is not the way the Oracle Universal Installer works; it will install the software on all nodes (the user does not have a choice which nodes can be patched), and promptly asks the user to run the root script after the patch has been deployed on all nodes. At what point is node 2 supposed to come down? Does it come down after running root.sh on node 1 and starting all services over there? The lack of lucidity is a bit frustrating.


The ASM upgrade was just as bad. There is a lot of PR that suggests that one can upgrade/patch ASM in a rolling fashion. However, when one starts to read the documentation, one finds out that rolling ASM upgrades are only available after one upgrades to 11g. So it finally became clear to us that we would have to take down the entire cluster to upgrade ASM from 10.2.0.3 to 11.1.0.6.

Following the online documentation (starting with the “Cluster ASM Upgrade” section of the Database Upgrade Guide), we first did the Install as suggested. However, it was not clear to use why we needed a 3rd (let alone a 4th or 5th) Oracle Home. This seems to be for the purpose of having a different owner, but the justification for a different owner is lacking. So we limited ourselves to the 2nd Home. When starting, the very first thing the documentation says is to use the “Pre-Upgrade Information Tool” (utlu111i.sql). This tool does not work in databases that are not open for read/write (ie, ASM databases). That was obviously an error in the documentation. We also hit Bug 6197966, and so we shut everything down in the middle of the install. These should be fixed in all shipped versions of the software.

The DBUA (following OUI) bombed horribly. I am hoping to have the time to raise that issue with Oracle Support. We opted to do the ASM upgrade manually which seemed to proceed reasonably well. The “Cluster ASM Upgrade” documentation suggests that we run EMCP (step 7). What is EMCP? I have no idea, and we did not run it.

The last step was to apply the 11.1.0.7 patch to ASM. The included documentation says absolutely nothing, except to reference the online documentation. We attempted to follow the directions for “Using ASM Rolling Upgrades” in the Storage Admin Guide. The first step is:
ALTER SYSTEM START ROLLING MIGRATION TO '11.1.0.7.0';

The directions next say:
“After the rolling upgrade has been started, you can shut down each ASM instance and perform the software upgrade. On start up, the updated ASM instance can rejoin the cluster. When you have migrated all of the nodes in your clustered ASM environment to the latest software version, you can end the rolling upgrade mode.”

That is very unclear. We set instance 1 to start the rolling migration, and attempted the patch install via OUI. The patch does not allow one to specify a node, so all nodes were patched at the same time. When we attempted to shutdown and restart ASM on node 1, we received a message that the instance could not start because it was in a rolling upgrade mode. When we attempted to connect to node 2, we were given a TNS connection denied error. We were essentially stuck, with no way to move forward. After failing to find any helpful documentation for our problems, I crashed ASM on both nodes and restarted them with no problems (now both at 11.1.0.7).

I have two more environments to upgrade, but I require that these issues be resolved before I attempt this upgrade again. If you have read the note this far, I would appreciate any feedback you might have.

Thanks for your time,

References
Appendix B of the Clusterware Installation Guide included with 11g Linux clusterware (not attached):
file:///C:/Documents%20and%20Settings/sac/Desktop/clusterware_doc/doc/install.111/b28263/procstop.htm#BABEHGJG

Clusterware Administration and Deployment Guide – Cluster Verification Utility:
http://download.oracle.com/docs/cd/B28359_01/rac.111/b28255/cvu.htm#CWADD530

11.1.0.7 Readme notes (not attached):
file:///C:/Documents%20and%20Settings/sac/Desktop/README.html

Cluster ASM Upgrade Guide:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28300/afterup.htm#CEGHJHBA

ASM upgrade bug 6197966:
http://download.oracle.com/docs/cd/B28359_01/readmes.111/b28280/toc.htm

ASM Manual upgrade:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28300/upgrade.htm#BABHJIFJ

ASM Rolling upgrades:
http://download.oracle.com/docs/cd/B28359_01/server.111/b31107/asminst.htm#CHDIIGIE

An 2007 whitepaper that came up in a google hit:
http://www.oracle.com/technology/products/database/clusterware/pdf/TWP_Clusterware_11g.pdf


3rd-party links
Upgrading the CRS from 10g to 11g:
http://jarneil.wordpress.com/2008/01/31/upgrading-to-oracle-11g-clusterware/

Upgrading the CRS and DBMS:
http://www.colestock.com/blogs/2007/09/upgrade-notes-from-10g-rac-to-11g-10203.html

Anonymous said...

Hi Charles,

Very informative post. Actually so interesting I DID read it,including the comments, up to its last sentence. We have a lot of 2 sites ASM RAC clusters, for which I am planning an 11g ASM-only upgrade soon to take advantage of the incremental DG rebuilds. Please let your readers know about any further development. I'll also try to keep you informed if we hit other issues.

Cheers

Christian Bilien

Charles Schultz said...

Thank you, Sir Bilien. =)

I assume your CRS is already 11g, since ASM 11g depends on CRS 11g. Right?

Going from 10g to 11g is actually quite easy with the DBUA. Unless you hit a snag like I did the first time around. If you like, I can forward you the "new" documentation notes from Oracle; I would love to post them here, but blogger is not quite as robust as WordPress, I am afraid. Unfortunately.

I would also like a discussion of why we need CRS_USER and ASM_USER. Maybe I'll post on oracle-l. Dan Norris is giving a presentation in 30 minutes about RAC for beginners, so I'll try to bug him there.

Dan Norris said...

The 3 OS users/software owners (CRS, ASM, RDBMS) are intended to enable true segregation of duties. The CRS guy should be able to manage CRS, but not ASM or RDBMS, the ASM guy to manage ASM but not CRS or RDBMS and RDBMS guy to do DB stuff, but not the other stuff.

In my testing, it felt a lot like this was someone's good idea that hadn't been quite followed through completely. Sort of reminded me of checkpoint_process in v7.3...it was recommended to turn it on, but if it broke, they could just tell you to turn it off :). I'm hoping that the separation will be cleaner under 11.2 (whenever that comes).

I would also say that separating the CRS owner from the other two is much cleaner than trying to separate ASM owner from RDBMS owner.

"See" you in 30 mins :).

Charles Schultz said...

Thanks, Dan.

So, who is my CRS guy? Who is my ASM guy? Personally, I want my CRS guy to be my Unix sysadmin, and my ASM guy to be my Storage Admin. But since this is Oracle software, they are leary of touching it with a 10-foot pole. Worse, my storage guy is not going to touch ASM with a 100-foot pole. =) If Oracle wants my storage guy to handle ASM, it needs to look like other Storage solutions (hhmmm... maybe EMC?). And sqlplus and/or asmcmd?? Ha!

When I talked to Kirk McGowan, he indicated that there was internal confusion about this separation of duties as well. I look forward to the day when it is ironed out.

Dan Norris said...

I agree, the "Oracle" moniker makes it harder to push these into the sysadmin or storage admin list of duties.

I have told the ASM guys for a long time that if they want to get past the "we're a Veritas shop" or "we can't manage ASM", I think it would help to eliminate sqlplus from the ASM tools and use only asmcmd to do all tasks. If a storage guy (or gal) sees sqlplus as required to do storage management, it'll fail for sure.

I think we're all trying to figure out the roles and responsibilities and I too look forward to sorting it out. Oh, and with Exadata now, it's just gotten about 1000% worse! :)

Charles Schultz said...

I am sure Christian Bilien, who is a storage guy at least part of the time, would be able to weigh in on that topic. =)

Did you mention asmcmd? We have a storage admin guy up in the Roosevelt Road Building on the corner of Roosevelt and Halsted. Oh wait, you are not downtown are you..... Anyway, I would love to witness you attempting to sell asmcmd to him. *grin*

Erik Peterson and Nitin mentioned EM at the end of today's session. While EM is better than asmcmd in some respects, there are still a host of dependencies that make EM a questionable solution for ASM management. I know Oracle is pushing it (which is one more reason to question it), but the fact of the matter is that Oracle provided tools are vastly different (in my observation) than the tools that storage admins usually use.

Dan Norris said...

Let me be more clear--I'm not trying to push anyone on asmcmd. However, I am saying that while asmcmd may not be anyone's favorite, it at least has a fighting chance of becoming a storage admin management tool (for non-DBAs). It is my opinion that while asmcmd may not thrill anyone, sqlplus as the primary management interface almost guarantees that only DBAs will manage ASM forever and leaves almost 0% chance that storage admins will participate.

Martin Berger said...

Charles,
can you give some more Infos about BUG:6197966? unfortunately MetaLink hides this one.
Has Oracle agreed there is a (docu-)bug about no ability to upgrade online within 11gR1?
so what curious,
Martin

Charles Schultz said...

Martin, the 11g documentation actually has something to say about it:
bug 6197966.

Anonymous said...

Charles Schultz of OraJourn issues a complaint regarding upgrading clusterware/ASM from 10g to 11g.

Anonymous said...

Hi Charles,

I had to wait Friday evening before posting my reply.
- I'd be grateful if you could get the "new" documentation for me
- CRS 11g/ASM11g: I went too fast. I was implying that we would do both.
- CRS vs ASM vs DB operations: this used to be a touchy subject at my site. Because it is Oracle, and because DB
servers are dedicated to the databases, we all agreed that the DBA would handle both the ASM and the cluster layer.
This was possible because
1) the DBAs anyway are very much involved in storage design, even if the ASM is not used.
2) the DBAs could prove they were able to manage and troubleshoot the cluster layer.

Incidentally there was a similar discussion about the new DBA tasks (DBA 2.0) on Jonathan Lewis'blog earlier in the month:
http://jonathanlewis.wordpress.com/2008/12/04/dba-20/

Cheers

Christian