Re: Character set conversion and database size

  • From: Mark Brinsmead <mark.brinsmead@xxxxxxxxx>
  • To: neil_chandler@xxxxxxxxxxx
  • Date: Fri, 14 Aug 2015 18:22:25 -0600

A few other considerations...


- Variable-length charactersets (e.g., AL32UTF8) can often avoid
excessive growth when most of your data comes from a US7ASCII characterset,
BUT...
- As already mentioned, unusual characters -- e.g., Chinese or
Japanese -- can take 2 to 4 bytes to store, and this can lead to
surprises.
- There is a performance penalty to be paid. Finding the 375th
character in a string with fixed-length characters is a simple array
access, but finding the 375th character with variable-length characters
requires you to read (and examine) ALL 374 preceding characters. In
extreme cases, where you do a lot of text processing, this can lead to an
explosive growth in CPU consumption and elapsed time.


- When using UTF at the database level, CLOB data is *always* stored as
fixed-length characters (AL16UTF16), precisely to avoid the explosive
growth in CPU consumption described above. As a result, your CLOB data
will ALWAYS double in size when you migrate to UTF. If you have a lot of
CLOB data, your database will grow regardless of whether you (try to) use
fixed-length or variable-length characters.
- When migrating, it is completely possible (and even likely) that the
data in your database will occasionally contain values outside of your
expected characterset. (This happens, for example, when clients "lie" to
the database, and say they are using the database characterset when really
they are not.) When this happens, you can get a lot of garbage when you
try to convert charactersets.
One of the worst cases I ever encountered was an application that was
storing 64-bit (binary!) IEEE floating point numbers in columns declared as
VARCHAR2(8). This works (at runtime) when all clients insist that they are
using the database characterset (and thus avoid characterset conversions in
the TNS layer), but it is an assured disaster when a characterset
conversion is performed.

Overall, migrating to UTF is probably a good idea. As we move forward, we
will probably find ever more need to be able to store Chinese, Japanese, or
Korean characters. Even if we don't need to do it now, it is nice to know
our databases are ready for it when the time comes. Sadly, though, the
migration process is rarely anywhere near as simple as it ought to be.



On Fri, Aug 14, 2015 at 5:07 PM, Neil Chandler <neil_chandler@xxxxxxxxxxx>
wrote:

Peter,

AL32UTF8 stores character in a variable amount of bytes, between 1 and 4
depending upon the character being stored.

If all you ever store are standard English letters, the DB size should be
the same. However, there are a couple of considerations which could cause
problems, depending upon your setting of NLS_LENGTH_SEMANTICS:

The default is BYTE, meaning a VARCHAR2(1) has 1 byte available to store
the characters. Storing characters which take more bytes can cause
errors. For example WE8 character sets use 1 byte to store a British Pound
symbol "£" or a Euro "€", whereas UTF8 takes 2 bytes. You will get a
value-too-large error if you try to store a "£" in a VARCHAR2(1 BYTE)
column.

"Hello World!" will take 12 bytes in both WE8 and AL32UTF8
"Hello World£" will take 12 bytes in WE8 but 13 bytes in AL32UTF8

If you change NLS_LENGTH_SEMANTICS to CHAR (or create the tables
explicitly with the CHAR option on each column), the database will create
the underlying definition for the column to have 4 bytes per character
instead of 1, meaning a VARCHAR2(1) has 4 bytes available to it. This
cannot exceed the column maximums, so a VARCHAR2(1000) has 4000 bytes
available but a VARCHAR2(4000) only has 4000 bytes available to it.

There is the potential for performance problems here. If the number of
bytes of a tables definition exceeds the Oracle block size, Oracle will not
perform set-based operations efficiently. For example, doing a FOR ALL
insert will revert to a row-by-row, slow-by-slow, insert. Having
NLS_LENGTH_SEMANTICS of CHAR means all of your VARCHAR2 columns will be 4
times larger. A VARCHAR2(255 CHAR) has an underlying definition of 1020
bytes. If you have 8 such columns in a table with a standard 8k block size,
and Oracle will assume you will get chained rows due to the potentially
large row size and go row-by-row.

You should look at dba_tab_columns at the data_length, char_length and
char_used columns.

AL16UTF16 stores characters in a fixed 2-byte format, which is better to
use if you mainly store characters which have 3 and 4 byte representations
in UTF8, like simplified Chinese, but will double the size of an
English-based database.

regards

Neil Chandler
*Hey, let’s be careful out there.*


Oracle 12.1.0.2 on Red Hat Linux.

IT management is pushing us to migrate our databases to the Unicode
character set ( NLS_CHARACTERSET= WE8MSWIN1252 to AL32UTF8). One of my
internal customers is concerned that conversion to multibyte characters
will dramatically increase the size of the database. I know that for
VARCHAR2s Oracle only allocates storage for the actual characters stored in
the string regardless maximum defined size of the column. Does the same
idea apply a VARCHAR2 column whose contents can be coded as single bytes or
does "Hello world!" actually take twice as much space in Unicode as it does
in the default character set?

Thanks,
Peter
--
//www.freelists.org/webpage/oracle-l



Other related posts: