Wednesday, August 27, 2008

Character Sets

I have been tasked with presenting on the topic of converting the character set for our Banner system databases. We have been through the exercise several times, which has been quite an eye-opening experience in itself (The DataPump Index work-around, dbms_pclxutil.build_part_index, creating tablespaces). Essentially, the vendor has decided in all its god-like wisdom to force all customers to start using a multibyte Unicode character set that will support international characters (AL32UTF8) when they upgrade to the latest version 8. Yea, not very practical for us at a Big Ten school.

Anyway, I have taken this thing to heart and have been learning a lot about codepoints and various facts about character sets. We will be converting from WE8ISO8859P1 to AL32UTF8, a single-byte ISO 8859-1 West European somewhat basic (eg, simple) codepoint to the latest'n'greatest multibyte grandaddy of Universal character sets. The big issue is making sure that all the data stays the same, even if physically it is stored a little different. Along the way, I have had to come to terms with phrases such as "Lossy Data" and "Truncation Data". April Sims has been a great help in answering some questions and getting me to focus my searches. In the end, Metalink note 252352.1 laid out an example that finally opened my eyes for Lossy Data.

I was having a hard time understanding why one must go through WE8MSWIN1252 first to resolve lossy data. And what was meant by the notation in Metalink Note 260192.1:
if you have a WE8ISO8859P1 database and lossy then changing your WE8ISO8859P1 to WE8MSWIN1252 will most likly solve you lossy [sic]

Why? Well, the Euro example in note 252352.1 is great. To paraphrase, the Euro is stored as codepoint 128 in WE8ISO8859P1. In this case, that character set essentially has an empty hole there, no previous definitions for that codepoint. So if one were to attempt to select it in some kind of "native" WE8ISO8859P1 environment, there would be nothing. So how did it get there? It is passed in from a different character set. And if using that same character set on the client side, you can easily retrieve the correct symbol.

Since WE8WSWIN1252 is a superset of WE8ISO8859P1 (meaning it has all the same codepoints for existing symbols, and then some), and it is most likely that a client in the US inserted data with a US Microsoft character set, the codepoint will be mapped to a valid symbol. Since that codepoint is "valid" in WE8MSWIN1252, it can then be converted to any other character set that also supports that symbol.

Where this falls apart (the "likly" part) is if the client using a "foreign" character set inserts a symbol that is mapped to a completely different symbol (via the codepoint) in our target client character set. In that case, you are screwed. =) Unless you would want to convert that data manually for each case of incompatible character sets. Not I.

Truncation data is a little easier to grasp. Metalink note 119119.1 mentions the English pound sign (£):
The pound sign for example is indeed 163 ( A3 in hexadecimal) in the WE8ISO8859P1 and WE8MSWIN1252 charactersets, but in AL32UTF8 the pound sign is code 49827 (C2 A3 in hex).

In our case, the fix is handled by setting nls_length_semantics=CHAR (opposed to the default BYTE). We also found out through experimentation that nls_length_semantics is completely ignored for SYS and SYSTEM objects; in fact, the data dictionary is always with BYTE semantics. Interesting, eh?

Metalink Notes:
  • 260192.1 Changing WE8ISO8859P1/ WE8ISO8859P15 or WE8MSWIN1252 to (AL32)UTF8
  • 225938.1 Database Character Set Healthcheck
  • 252352.1 Euro Symbol Turns up as Upside-Down Questionmark
  • 279068.1 Usage of Character Set Scanner Utility (csscan) version 1.1
  • 137127.1 Character Sets, Code Pages, Fonts and the NLS_LANG Value
  • 241047.1 The Priority of NLS Parameters Explained
  • 158577.1 NLS_LANG Explained (How does Client-Server Character Conversion Work?)
  • 179133.1 The correct NLS_LANG in a Windows Environment
  • 119119.1 AL32UTF8 / UTF8 (Unicode) Database Character Set Implications

Oracle® Database Globalization Support Guide 10g Release 2 (10.2)


David Miller said...

Wow, I was just impressed at the volume of information from your post. I stumbled upon it by accident. Great stuff, even though I didn't understand all of it.

Charles Schultz said...

If there is anything that is unclear, let me know; the last thing I want to do is muddy the waters. =)

Glad you enjoyed it. I wish there was a way for me to post my powerpoint presentation, as it has some good examples as well.

Interesting stuff.

Anonymous said...

Wow, almost 7 years later I am stumbling on this in an Oracle, I am using NLS_CHARACTERSET=WE8ISO8859P1. The application team wants to be able to handle "Hawaiian" and other characters....what CHARACTERSET should I move to?