Wednesday, August 27, 2008

Character Sets

I have been tasked with presenting on the topic of converting the character set for our Banner system databases. We have been through the exercise several times, which has been quite an eye-opening experience in itself (The DataPump Index work-around, dbms_pclxutil.build_part_index, creating tablespaces). Essentially, the vendor has decided in all its god-like wisdom to force all customers to start using a multibyte Unicode character set that will support international characters (AL32UTF8) when they upgrade to the latest version 8. Yea, not very practical for us at a Big Ten school.

Anyway, I have taken this thing to heart and have been learning a lot about codepoints and various facts about character sets. We will be converting from WE8ISO8859P1 to AL32UTF8, a single-byte ISO 8859-1 West European somewhat basic (eg, simple) codepoint to the latest'n'greatest multibyte grandaddy of Universal character sets. The big issue is making sure that all the data stays the same, even if physically it is stored a little different. Along the way, I have had to come to terms with phrases such as "Lossy Data" and "Truncation Data". April Sims has been a great help in answering some questions and getting me to focus my searches. In the end, Metalink note 252352.1 laid out an example that finally opened my eyes for Lossy Data.

I was having a hard time understanding why one must go through WE8MSWIN1252 first to resolve lossy data. And what was meant by the notation in Metalink Note 260192.1:
if you have a WE8ISO8859P1 database and lossy then changing your WE8ISO8859P1 to WE8MSWIN1252 will most likly solve you lossy [sic]


Why? Well, the Euro example in note 252352.1 is great. To paraphrase, the Euro is stored as codepoint 128 in WE8ISO8859P1. In this case, that character set essentially has an empty hole there, no previous definitions for that codepoint. So if one were to attempt to select it in some kind of "native" WE8ISO8859P1 environment, there would be nothing. So how did it get there? It is passed in from a different character set. And if using that same character set on the client side, you can easily retrieve the correct symbol.

Since WE8WSWIN1252 is a superset of WE8ISO8859P1 (meaning it has all the same codepoints for existing symbols, and then some), and it is most likely that a client in the US inserted data with a US Microsoft character set, the codepoint will be mapped to a valid symbol. Since that codepoint is "valid" in WE8MSWIN1252, it can then be converted to any other character set that also supports that symbol.

Where this falls apart (the "likly" part) is if the client using a "foreign" character set inserts a symbol that is mapped to a completely different symbol (via the codepoint) in our target client character set. In that case, you are screwed. =) Unless you would want to convert that data manually for each case of incompatible character sets. Not I.

Truncation data is a little easier to grasp. Metalink note 119119.1 mentions the English pound sign (£):
The pound sign for example is indeed 163 ( A3 in hexadecimal) in the WE8ISO8859P1 and WE8MSWIN1252 charactersets, but in AL32UTF8 the pound sign is code 49827 (C2 A3 in hex).


In our case, the fix is handled by setting nls_length_semantics=CHAR (opposed to the default BYTE). We also found out through experimentation that nls_length_semantics is completely ignored for SYS and SYSTEM objects; in fact, the data dictionary is always with BYTE semantics. Interesting, eh?

Metalink Notes:
  • 260192.1 Changing WE8ISO8859P1/ WE8ISO8859P15 or WE8MSWIN1252 to (AL32)UTF8
  • 225938.1 Database Character Set Healthcheck
  • 252352.1 Euro Symbol Turns up as Upside-Down Questionmark
  • 279068.1 Usage of Character Set Scanner Utility (csscan) version 1.1
  • 137127.1 Character Sets, Code Pages, Fonts and the NLS_LANG Value
  • 241047.1 The Priority of NLS Parameters Explained
  • 158577.1 NLS_LANG Explained (How does Client-Server Character Conversion Work?)
  • 179133.1 The correct NLS_LANG in a Windows Environment
  • 119119.1 AL32UTF8 / UTF8 (Unicode) Database Character Set Implications

Others:
Oracle® Database Globalization Support Guide 10g Release 2 (10.2)

Friday, August 01, 2008

Software Configuration Manager

I learned that Oracle Support has a couple Oracle blogs up:
http://blogs.oracle.com/supportportal/
http://blogs.oracle.com/Support/

I think I was just checking on them when I noticed that the supportPortal was being a little more aggressive looking for more feedback. Over the years, I have participated in many of the Survey's that Oracle sends you after you close an SR, and have made no small number of comments (both good and bad) to a growing number of contacts I have established within Oracle, so this seemed like yet another opportunity to voice my opinion. I think I do that much more when I get the impression Oracle is actually listening; I don't share the same thoughts here because, well.... I highly doubt Oracle is reading this. =) No offense.

So let me blur the lines a little. First, I think Oracle is doing a much better job at collecting client feedback. However, the more critical part is what they are doing with that feedback. Unfortunately, for us end users, those comments seem to disappear into a black box, never to surface again (much like Enhancement Requests). I chatted with some really nice folks at Oracle (Kristal and Cherryl) and they both reassured me that Oracle is taking our comments seriously, and in fact hinted that the ongoing evolution of Metalink 3 is largely due to user feedback, and even newer things are coming down the pike.

Taking the invitation from the SupportPortal blog seriously, I called up for an appointment and got hooked up right away. I think the lady running the feedback session was working at home (heard a small child in the background), and she had some setup work to accomplish before we got down to it. But 90 minutes later I had made numerous observations about many aspects of the Software Configuration Manager and Metalink 3 in general. It was refreshing (and exhausting) to not only state some things I either liked or did not, but also to demonstrate via a recorded web session. Very satisfying; I recommend this to anyone who deals with metalink on a frequent basis.

So, what did I say? In general, I like the eye-candy of the new interface, but it is also a little slow and sometimes downright confusing. For instance, there is no way to bookmark bugs. The options to configure the layout are a nice thought, but still a little limited; it reminds me of how limited blogger is compared to WordPress, if that helps at all. Don't get me wrong, I think it is quite handy to configure the layout, but I really look forward to improvements. The key point, Software Configuration Manager, is a great concept. Unfortunately, I have some political roadblocks to fully utilizing SCM with my employer, which makes it a little frustrating when attempting to comment on the SCM because it is not SCM's fault at all. =) We require that we either have separate CSI numbers, or implement some sort of access to control such that not all our end-users have equal access to a site-wide license. But if we ignore that for a little while, I think SCM is awesome and has a lot of potential. Just the fact that it helps speed up SRs significantly is a big bonus. I have heard some misgivings about allowing Oracle to retrieve information in such a fashion, even if only passively, but I personally do not have an issue with it.

I do not like how some of the questions are completely inane. For instance, 99% of my SRs are for the Oracle Server - Enterprise Edition. 99% of the time I forget what exactly my "Product" is called and have to hunt for it amongst a list of several hundred product options, most of which we are not even licensed for. Why? What about a list of 10 most recent products for my CSI? What about a list that only contains products I am licensed for? And when attempting to find a category that somewhat matches the subject of an SR, I hate having to select one from the list of 30 or so. I want a searchable list. And a fast search at that; the LOVs in the patch pages really suck.

I concluded by saying that it would be vitally important to me see how Oracle is using this feedback. I have repeated this message to my Oracle contacts and with nearly all feedback I give to Oracle. I long to see what difference I make. Don't you?

Kudos to the teams involved in the feedback gathering operations. And BIG PROPS to Cheryl, Joey and Kristal (all of Oracle).