Re: Problems with UTF8

From: "Martin Berger" <martin.a.berger@xxxxxxxxx>
To: gokulkumar.gopal@xxxxxxxxx
Date: Tue, 14 Oct 2008 08:01:31 +0200
Hi Gokul,
I'm afraid, yo missed one part. Let me explain the (possible)
character conversion chain (and how oracle acts on its part):
In most environments you have more than 2 involved modules. e.g.
PC <-> Terminal on DB-Server <-> sqlplus <-> Database
all of them might use different character sets, some of them try to do
a more or less correct conversion, others do not.

I do not know much about the conversion between PC and Terminal, but
let's face the part Terminal <-> sqlplus <-> DB:

Oracle supports different character sets. It also supports different
character sets on client (sqlplus) and server. in such a case one of
them tries to convert the characters. If the client can do it, he is
supposed to do so (for performance reasons), if he can't (for any
reason) the server process is doing so.
From Oracles point of view it's much easier if the client and DB have
the same character set setting. So NO conversion and NO formal check
(if the character is valid [1]) is done. It's often hard to find such
situations: let's assume there is a Terminal with ISO8859P1 character
set, sqlplus and DB has both AL32UTF8. you try to insert a Paragraph
(§ - UNICODE: 0xA7, UTF8: 0xC2A7) the terminal generates 0xA7. As
there is no conversion  between terminal and sqlplus, sqlplus just
takes this character and passes it to the DB. DB and sqlplus have same
character set, therefore no one converts here. If the statement is an
insert, you inserted a wrong character into the DB, and no one raised
an error (!). To make it worse, if you select from this table, it's
all the same way back, you will see a § on the terminal, and also that
way no one knows, something goes wrong.

To check if there is such a problem, you can use the dump() function.

In this particular case, you can run

SELECT dump('AíB',1016) from dual;

and get

DUMP('AíB',1016)
----------------------------------------------
Typ=1 Len=4 CharacterSet=AL32UTF8: 41,c3,ad,42


you see 4 bytes: A => 0x41, í => 0xC3AD, B => 0x 42

If you get any other results, it's time to investigate (in an AL32UTF8 DB!).

Now to your solution:

> We finally got this working using NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1 and 
> LANG=en_US.ISO8859-1

You still does not mention the part PC <-> Terminal. If you are using
PuTTY or some similar terminal, check its settings. I'm pretty sure it
is also ISO8859-1. :-)

I hope I could help you understanding this issue.

best regards,
 Martin

[1] in multibyte character sets, there are ranges which are NOT
matched to characters and therefore are invalid. e.g. there is NO
character 0x80: UNICODE 0x7F is matched to UTF8 0x7F, UNICODE 0x80 is
matched to UTF8 0xC280
You can use http://berx.at/tools/UNICODE.HTML to convert around a
little bit. it's not perfect (it accepts also false' codes) but it
only generates correct ones.

--
Martin Berger
http://berxblog.blogspot.com
--
//www.freelists.org/webpage/oracle-l
References:
- Problems with UTF8
  - From: gokulkumar gopal
Re: Problems with UTF8

Other related posts: