Re: Character set conversion and database size

From: Kenny Payton <k3nnyp@xxxxxxxxx>
To: peter.schauss@xxxxxxx
Date: Tue, 18 Aug 2015 19:44:40 -0400

I would suggest wading into this carefully, there are lots of considerations
and planning to do. A couple years back we converted 6 we8iso8859p1 databases
to utf8 that were each 20-25T in size at the time and were a nightmarish mix of
lossy and convertible data. We had to convert two in each maintenance window
and had a number of issues to deal with. I will say we deviated dramatically
from the supported approach and it’s not something I would advise to anyone if
it could be avoided. Here are a couple things to think about from my
experiences.

Read up on csscan and csalter
csscan detects problematic characters
csalter changes the characterset
Read up on the two categories of problematic characters
Lossy - characters that are not valid to your characterset. This data needs to
be fixed and/or deleted from your database prior to changing the character set.
These characters are unknown to your characterset. The code points do not
exist. Believe it or not, this is easier than you think. One thing I didn’t
think much about before this project was if you have a client with the same
characterset settings as your database the database does not need to do any
conversion and also doesn’t bother with any validation. If your application is
loading other characterset ( possibly unicode/utf8 ) data via this method your
database loads it on in.
We had a large amount of utf8 data in our database and left it in place. After
the conversion the data represented fine since we were converting to utf8. The
tricky part was tricking the alter into thinking the scan was clean.
I forget the option but there is an option during the scan to provide
guesstimates for your lossy data to give you an idea of what character sets
they could be part of.
We found the majority of our non-utf8 lossy data was we8mswin1252 characters.
We converted this data with an internal package we stumbled across
SYS_OP_CSCONV.
Convertible - these are characters that exist in the current characterset and
also exist in the new characterset but have different code points. This data
needs to be exported, deleted and later re-imported into the database after the
conversion has occurred.
Take a look at the Data Migration Assistant for Unicode ( DMU ) It was in its
infancy and pretty buggy when we were doing our migrations but it might be
better now.
http://www.oracle.com/technetwork/database/database-technologies/globalization/dmu/overview/index.html

<http://www.oracle.com/technetwork/database/database-technologies/globalization/dmu/overview/index.html>

We had to get pretty creative to meet our change windows with millions of rows
of lossy and convertible data and csalter's requirement to have a clean scan
prior to changing the characterset. We actually hacked this part. That was
not a fun part but seemed pretty much unavoidable with our constraints.
Ultimately leveraged a temporary physical standby, a fairly long and complex
series of steps and some inhouse developed perl processes for mass parallel
processing.

Hopefully your conversion is painless but if you run into some stumbling blocks
you can sometimes overcome them with a little creativity and dedication.

Good luck

Kenny

On Aug 14, 2015, at 3:08 PM, Schauss, Peter (ESS) <peter.schauss@xxxxxxx>
wrote:

Oracle 12.1.0.2 on Red Hat Linux.

IT management is pushing us to migrate our databases to the Unicode character
set ( NLS_CHARACTERSET= WE8MSWIN1252 to AL32UTF8). One of my internal
customers is concerned that conversion to multibyte characters will
dramatically increase the size of the database. I know that for VARCHAR2s
Oracle only allocates storage for the actual characters stored in the string
regardless maximum defined size of the column. Does the same idea apply a
VARCHAR2 column whose contents can be coded as single bytes or does "Hello
world!" actually take twice as much space in Unicode as it does in the
default character set?

Thanks,
Peter
--
//www.freelists.org/webpage/oracle-l

References:
- Character set conversion and database size
  - From: Schauss, Peter (ESS)

Re: Character set conversion and database size

Other related posts: