dbTalk Databases Forums  

Oracle Bug? RPAD of Japanese (kanji) character in Oracle 10gR2 UTF8database

comp.databases.oracle.tools comp.databases.oracle.tools


Discuss Oracle Bug? RPAD of Japanese (kanji) character in Oracle 10gR2 UTF8database in the comp.databases.oracle.tools forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Kevin Kirkpatrick
 
Posts: n/a

Default Oracle Bug? RPAD of Japanese (kanji) character in Oracle 10gR2 UTF8database - 01-16-2009 , 12:23 PM







I've searched metalink and not seen a mention, but want to make sure
I'm not missing anything obvious in calling this a bug. CHR(15121570)
is the UTF8 character point representing kanji character '$B4A(B'. It
seems like RPAD is not properly padding it out to a full 4 characters
in the example below:

SELECT RPAD(CHR(15121570),4,'*') FROM DUAL;
RESULT:
$B4A(B**

A LENGTH() function reveals that the RPAD is only creating a string
with 3 characters:

SELECT LENGTH(RPAD(CHR(15121570),4,'*')) FROM DUAL;
RESULT:
3

The same logic against a different multi-byte UTF8 character, the
Microsoft ellipse, shows the expected behavior:

SELECT RPAD(CHR(14844070),4,'*') FROM DUAL;
RESULT:
$B!D(B***

SELECT LENGTH(RPAD(CHR(14844070),4,'*')) FROM DUAL;
RESULT:
4

(note - i'm using '*' to make the RPAD functionality more visible; the
same behavior occurs with the default blank-space, eg RPAD(CHR
(15121570),4))

Verifying the UTF8 encoding:
select * from nls_database_parameters where parameter =
'NLS_CHARACTERSET';
RESULT:
NLS_CHARACTERSET UTF8


So, the question - does anyone see any obvious oversight on my part,
or should I consider this a "probable bug" in need of a TAR?


Reply With Quote
  #2  
Old   
Kevin Kirkpatrick
 
Posts: n/a

Default Re: Oracle Bug? RPAD of Japanese (kanji) character in Oracle 10gR2UTF8 database - 01-16-2009 , 12:33 PM






On Jan 16, 12:23 pm, Kevin Kirkpatrick <kvnkrkpt... (AT) gmail (DOT) com> wrote:
Quote:
I've searched metalink and not seen a mention, but want to make sure
I'm not missing anything obvious in calling this a bug. CHR(15121570)
is the UTF8 character point representing kanji character 'h'. It
seems like RPAD is not properly padding it out to a full 4 characters
in the example below:

SELECT RPAD(CHR(15121570),4,'*') FROM DUAL;
RESULT:
h**

A LENGTH() function reveals that the RPAD is only creating a string
with 3 characters:

SELECT LENGTH(RPAD(CHR(15121570),4,'*')) FROM DUAL;
RESULT:
3

The same logic against a different multi-byte UTF8 character, the
Microsoft ellipse, shows the expected behavior:

SELECT RPAD(CHR(14844070),4,'*') FROM DUAL;
RESULT:
...***

SELECT LENGTH(RPAD(CHR(14844070),4,'*')) FROM DUAL;
RESULT:
4

(note - i'm using '*' to make the RPAD functionality more visible; the
same behavior occurs with the default blank-space, eg RPAD(CHR
(15121570),4))

Verifying the UTF8 encoding:
select * from nls_database_parameters where parameter =
'NLS_CHARACTERSET';
RESULT:
NLS_CHARACTERSET UTF8

So, the question - does anyone see any obvious oversight on my part,
or should I consider this a "probable bug" in need of a TAR?
Nevermind, I think I've discovered the cause after using better search
terms in Metalink. From a metalink article:
LPad and Rpad count in "display units" wich means that a

select lengthb(rpad(bytestst ,10,'x')) lengthb,
lengthc(rpad(bytestst ,10,'x')) lengthc,
rpad(bytestst ,10,'x') from
(select '' bytestst from dual);

will result in 10 characters but 11 bytes ! (seen is in UTF8 2
bytes).

In complex scripts where one "character on the display" may be
composed by
several combined characters the difference (nr of chars vs bytes)
may be
even bigger.

With some scripts the number of characters returned may also varry:

For example most Asian characters are 2 "display units" wide, so the
Japanese
ס (wich is a FULL WIDTH (or zenkaku) character) will "count for 2"
in L/RPAD

select lengthc(rpad(japanesechar ,10,'x')) lengthc,
rpad(japanesechar ,10,'x') from (select UNISTR('\4F4F')
japanesechar from dual);

Lengthc will return 9 characters.

The thing is that the Japanese character ס takes 2 times the display
width
of a ascii character like x hence the RPAD will make the string 9
characters
in total (2 time display for one ס + 8 times display for 8 x = 10
display
units)

If you really want the number of characters then you can use
something like:

RPAD ( str , n - LENGTHC(str),'c')

Use LENGTHB, if the requested width is in bytes, LENGTHC, if in
characters

Or

SUBSTR( str || RPAD( 'c', n, 'c' ), 1, n )

Use SUBSTRB, if the requested width is in bytes, SUBSTR, if in
characters

In above the
* str is the string to be padded.
*'c' is the fill character (usually blank) -- we a assume single-
byte char
* n is the requested width in bytes or characters


Sorry for posting prematurely - hopefully the above info will be of
use to someone!


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.