dbTalk Databases Forums  

UTF-8 databases

comp.databases.ibm-db2 comp.databases.ibm-db2


Discuss UTF-8 databases in the comp.databases.ibm-db2 forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Troels Arvin
 
Posts: n/a

Default UTF-8 databases - 09-10-2010 , 03:17 PM






Hello,

I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5
server.

For the first time, I set set it up using UTF-8, thinking that the time
has come for this with DB2.

But now I'm seeing strange situations like:

CREATE TABLE foo (x VARCHAR(5));
INSERT INTO foo VALUES('XXåXX');
SQL0433N Value "XXåXX" is too long. SQLSTATE=22001
INSERT INTO foo VALUES('XXåX')

SELECT * FROM foo

X
-----
XXåX

db2 "SELECT LENGTH(x) FROM foo"

1
-----------
5

This is absurd.

I have tried various combinations of CREATE DATABASE specifications
without luck.

Does DB2 still not support UTF-8 properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?

--
Troels

Reply With Quote
  #2  
Old   
Luuk
 
Posts: n/a

Default Re: UTF-8 databases - 09-10-2010 , 03:48 PM






Op 10-09-10 22:17, Troels Arvin schreef:
Quote:
Hello,

I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5
server.

For the first time, I set set it up using UTF-8, thinking that the time
has come for this with DB2.

But now I'm seeing strange situations like:

CREATE TABLE foo (x VARCHAR(5));
INSERT INTO foo VALUES('XXåXX');
SQL0433N Value "XXåXX" is too long. SQLSTATE=22001
INSERT INTO foo VALUES('XXåX')

SELECT * FROM foo

X
-----
XXåX

db2 "SELECT LENGTH(x) FROM foo"

1
-----------
5

This is absurd.

I have tried various combinations of CREATE DATABASE specifications
without luck.

Does DB2 still not support UTF-8 properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?

i think your question was answered here:
http://bytes.com/topic/db2/answers/6...multibyte-data



--
Luuk

Reply With Quote
  #3  
Old   
Luuk
 
Posts: n/a

Default Re: UTF-8 databases - 09-10-2010 , 03:59 PM



Op 10-09-10 22:48, Luuk schreef:
Quote:
Op 10-09-10 22:17, Troels Arvin schreef:
Hello,

I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5
server.

For the first time, I set set it up using UTF-8, thinking that the time
has come for this with DB2.

But now I'm seeing strange situations like:

CREATE TABLE foo (x VARCHAR(5));
INSERT INTO foo VALUES('XXåXX');
SQL0433N Value "XXåXX" is too long. SQLSTATE=22001
INSERT INTO foo VALUES('XXåX')

SELECT * FROM foo

X
-----
XXåX

db2 "SELECT LENGTH(x) FROM foo"

1
-----------
5

This is absurd.

I have tried various combinations of CREATE DATABASE specifications
without luck.

Does DB2 still not support UTF-8 properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?


i think your question was answered here:
http://bytes.com/topic/db2/answers/6...multibyte-data



C:\Program Files>db2 SELECT LENGTH('XXåXX') from foo

1
-----------
6

1 record(s) selected.


C:\Program Files>db2 SELECT LENGTH('XX€XX') from foo

1
-----------
7

1 record(s) selected.


C:\Program Files>

--
Luuk

Reply With Quote
  #4  
Old   
Tonkuma
 
Posts: n/a

Default Re: UTF-8 databases - 09-10-2010 , 07:38 PM



------------------------------ Commands Entered
------------------------------
SELECT LENGTH(x) AS "LENGTH(x)"
, CHARACTER_LENGTH(x , CODEUNITS16) AS "CHARACTER_LENGTH(x ,
CODEUNITS16)"
FROM (VALUES 'XX€XX') foo(x);
------------------------------------------------------------------------------

LENGTH(x) CHARACTER_LENGTH(x , CODEUNITS16)
----------- ---------------------------------
7 5

1 record(s) selected.

Reply With Quote
  #5  
Old   
Helmut Tessarek
 
Posts: n/a

Default Re: UTF-8 databases - 09-10-2010 , 07:47 PM



Hi Troels,

Quote:
Does DB2 still not support UTF-8 properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?
In a unicode database the length of a (VAR)CHAR does not represent the number
of characters, but the number of bytes.

In UTF8 a character can have up to 4 bytes. How should the database know which
characters you will insert. If the database calculated 4 times the characters
automatically, you would lose a lot of space when just inserting standard
characters.

The problem that you are seeing is that umlauts or other accented characters
are 2 bytes in unicode compared to the latin1 character sets where those
characters are only 1 byte long.

--
Helmut K. C. Tessarek
DB2 Performance and Development

/*
Thou shalt not follow the NULL pointer for chaos and madness
await thee at its end.
*/

Reply With Quote
  #6  
Old   
Henrik Loeser
 
Posts: n/a

Default Re: UTF-8 databases - 09-13-2010 , 06:33 AM



On Sep 10, 10:17*pm, Troels Arvin <tro... (AT) arvin (DOT) dk> wrote:
Quote:
Hello,

I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5
server.

For the first time, I set set it up using UTF-8, thinking that the time
has come for this with DB2.

But now I'm seeing strange situations like:

CREATE TABLE foo (x VARCHAR(5));
INSERT INTO foo VALUES('XXåXX');
* SQL0433N *Value "XXåXX" is too long. *SQLSTATE=22001
INSERT INTO foo VALUES('XXåX')

SELECT * FROM foo

X * *
-----
XXåX

db2 "SELECT LENGTH(x) FROM foo"

1 * * * * *
-----------
* * * * * 5

This is absurd.

I have tried various combinations of CREATE DATABASE specifications
without luck.

Does DB2 still not support UTF-8 properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?

--
Troels
What should work is to make the VARCHAR columns larger, but to use
check constraints:
db2 => create table t(id int, s varchar(20)
check(char_length(s,codeunits32)<11))
DB20000I The SQL command completed successfully.
db2 => insert into t values(1,'öüäöüä')
DB20000I The SQL command completed successfully.
db2 => insert into t values(2,'testtest')
DB20000I The SQL command completed successfully.
db2 => insert into t values(3,'testtesttest')
DB21034E The command was processed as an SQL statement because it was
not a
valid Command Line Processor command. During SQL processing it
returned:
SQL0545N The requested operation is not allowed because a row does
not
satisfy the check constraint "HLOESER.T.SQL100913132928480".
SQLSTATE=23513
db2 => select length(s),id from t

1 ID
----------- -----------
12 1
8 2

2 record(s) selected.

You have to use CODEUNITS32 for UTF-8 data.

Henrik

Reply With Quote
  #7  
Old   
Troels Arvin
 
Posts: n/a

Default Re: UTF-8 databases - 09-13-2010 , 01:26 PM



Henrik Loeser wrote:
Quote:
What should work is to make the VARCHAR columns larger, but to use check
constraints:
[...]

Thanks. Yes, but this would entail a large amount of code inspection and
query/DDL rewrites - which is not an option.

--
Troels

Reply With Quote
  #8  
Old   
Henrik Loeser
 
Posts: n/a

Default Re: UTF-8 databases - 09-14-2010 , 12:46 AM



On Sep 13, 8:26*pm, Troels Arvin <tro... (AT) arvin (DOT) dk> wrote:
Quote:
Henrik Loeser wrote:
What should work is to make the VARCHAR columns larger, but to use check
constraints:

[...]

Thanks. Yes, but this would entail a large amount of code inspection and
query/DDL rewrites - which is not an option.

--
Troels
With UTF-8/Unicode databases it's a trade-off between flexibility to
have the option to store different characters vs. the usually more
efficient local, but restricted codepages. It is your call.

Henrik

Reply With Quote
  #9  
Old   
Lennart
 
Posts: n/a

Default Re: UTF-8 databases - 10-15-2010 , 01:41 AM



On Sep 11, 2:47*am, Helmut Tessarek <tessa... (AT) evermeet (DOT) cx> wrote:
Quote:
Hi Troels,

Does DB2 still not supportUTF-8properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?

In a unicode database the length of a (VAR)CHAR does not represent the number
of characters, but the number of bytes.

Is it generally defined this way or is it a db2 definition? Anyhow,
I'm wondering whether it has been discussed to introduce a string type
that "hides" the additional space requirements? I.e. if I declare a
column as new_varchar(3) it would reserve up to 12 bytes.

I can understand the rational by to days construction, but in many
situations one have a database where the size of the data is small,
but there are a lot of standard domains (say street address for
example). These domains will require tweaking (multiply size and add
check constraints as Henrik proposed).


/Lennart

Reply With Quote
  #10  
Old   
Frederik Engelen
 
Posts: n/a

Default Re: UTF-8 databases - 10-15-2010 , 02:55 AM



On Oct 15, 8:41*am, Lennart <erik.lennart.jons... (AT) gmail (DOT) com> wrote:
Quote:
On Sep 11, 2:47*am, Helmut Tessarek <tessa... (AT) evermeet (DOT) cx> wrote:

Hi Troels,

Does DB2 still not supportUTF-8properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?

In a unicode database the length of a (VAR)CHAR does not represent the number
of characters, but the number of bytes.

Is it generally defined this way or is it a db2 definition? Anyhow,
I'm wondering whether it has been discussed to introduce a string type
that "hides" the additional space requirements? I.e. if I declare a
column as new_varchar(3) it would reserve up to 12 bytes.

I can understand the rational by to days construction, but in many
situations one have a database where the size of the data is small,
but there are a lot of standard domains (say street address for
example). These domains will require tweaking (multiply size and add
check constraints as Henrik proposed).

/Lennart
I agree with the above statement. It gives DB2 a bad reputation having
to explain this to developers over and over again. One can try the
vargraphic data type, but that will hugely increase the size of your
database.

--
Frederik Engelen

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.