dbTalk Databases Forums  

UTF-8 databases

comp.databases.ibm-db2 comp.databases.ibm-db2


Discuss UTF-8 databases in the comp.databases.ibm-db2 forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Helmut Tessarek
 
Posts: n/a

Default Re: UTF-8 databases - 10-15-2010 , 04:37 AM






Quote:
I agree with the above statement. It gives DB2 a bad reputation having
to explain this to developers over and over again. One can try the
vargraphic data type, but that will hugely increase the size of your
database.
Ok, now I'm a little bit confused. Using a new datatype which masks the byte
size (in favor of all possible characters) or using VARGRAPHIC - what is the
difference?
In any case, we have to abide by the rules of computer science. No database
will fit a 5 byte character string in a 4 byte field. Don't get me wrong,
there might be databases which handle this differently, but the rules are the
same and always will be.
Just because other databases won't tell you that space is lost by doing a
generalized datatype is not going to change the fact that space (and
performance) is lost.
I'm not sure how much I should go into detail in regards to programming, but I
guess most of you do have extensive knowledge in programming and the internals
of compiler and processor fundamentals.

So please, if you have a valid idea how to fit a 5 byte character string into
a 4 byte field (which is just an example - it could also be a 20 byte string
into a 5 byte field), humor me. Don't mention compression, because this we
aready have. But we are not talking about compression, we are talking about
the low level representation of data.

I always dreamt about unicode capable CPUs, but nobody is listening to me...

--
Helmut K. C. Tessarek
DB2 Performance and Development

/*
Thou shalt not follow the NULL pointer for chaos and madness
await thee at its end.
*/

Reply With Quote
  #12  
Old   
Frederik Engelen
 
Posts: n/a

Default Re: UTF-8 databases - 10-15-2010 , 06:42 AM






On Oct 15, 11:37*am, Helmut Tessarek <tessa... (AT) evermeet (DOT) cx> wrote:
Quote:
I agree with the above statement. It gives DB2 a bad reputation having
to explain this to developers over and over again. One can try the
vargraphic data type, but that will hugely increase the size of your
database.

Ok, now I'm a little bit confused. Using a new datatype which masks the byte
size (in favor of all possible characters) or using VARGRAPHIC - what is the
difference?
In any case, we have to abide by the rules of computer science. No database
will fit a 5 byte character string in a 4 byte field. Don't get me wrong,
there might be databases which handle this differently, but the rules arethe
same and always will be.
Just because other databases won't tell you that space is lost by doing a
generalized datatype is not going to change the fact that space (and
performance) is lost.
I'm not sure how much I should go into detail in regards to programming, but I
guess most of you do have extensive knowledge in programming and the internals
of compiler and processor fundamentals.

So please, if you have a valid idea how to fit a 5 byte character string into
a 4 byte field (which is just an example - it could also be a 20 byte string
into a 5 byte field), humor me. Don't mention compression, because this we
aready have. But we are not talking about compression, we are talking about
the low level representation of data.

I always dreamt about unicode capable CPUs, but nobody is listening to me....

--
Helmut K. C. Tessarek
DB2 Performance and Development

/*
* *Thou shalt not follow the NULL pointer for chaos and madness
* *await thee at its end.
*/
Hello Helmut,

I really understand your argument, I know where the limit comes from.
Do you want me to make a feature request to the Power team? ;-)

But from a users point of view, our problem is not solved. Using a
vargraphic can be problematic because it will double the size of your
database (assuming you only store strings), even if you restrict
yourself to the ASCII characters 99% of the time. Perhaps this
doubling can be reduced back to a reasonable amount by using
compression, I never tested this.

I would also be very happy with basic functionality like this:

- the new varchar type reserves 4 times the space specified
- the length is checked on the specified size (duh...)
- all scalar functions (LENGTH, etc...) automatically apply the
CODEUNITS32 keyword on this data type

I think that should be about it. Someone will probably come up with
some other requirements, but i made my point.

The downside would be that we would have to move to a bigger pagesize
more quickly. This would come from the fact that a row cannot span
more than one page, in contrast to other database systems. Usually
this is not a big deal, but it might be the reason they could already
implement a "VARCHAR(20 CHAR)" approach. I personally wouldn't mind
this limitation. With the 255 rows/page limit expanded, the chance
that this would wast space is also greatly reduced.

What do you think?

Kind regards,

Frederik Engelen

Reply With Quote
  #13  
Old   
Lennart
 
Posts: n/a

Default Re: UTF-8 databases - 10-15-2010 , 08:35 AM



On Oct 15, 11:37*am, Helmut Tessarek <tessa... (AT) evermeet (DOT) cx> wrote:
Quote:
I agree with the above statement. It gives DB2 a bad reputation having
to explain this to developers over and over again. One can try the
vargraphic data type, but that will hugely increase the size of your
database.

Ok, now I'm a little bit confused. Using a new datatype which masks the byte
size (in favor of all possible characters) or using VARGRAPHIC - what is the
difference?
Admittedly I did not even reflect over using VARGRAPHIC. I assume
there are some things one should consider before choosing it. Does
anyone have a link to the docs where varchar and vargraphic are
compared, what restrictions that apply in each case, possible gotchas,
etc?

/Lennart

[...]

Reply With Quote
  #14  
Old   
Helmut Tessarek
 
Posts: n/a

Default Re: UTF-8 databases - 10-16-2010 , 12:00 AM



Hi Frederik,

On 15.10.10 1:42 , Frederik Engelen wrote:
Quote:
I would also be very happy with basic functionality like this:

- the new varchar type reserves 4 times the space specified
- the length is checked on the specified size (duh...)
- all scalar functions (LENGTH, etc...) automatically apply the
CODEUNITS32 keyword on this data type
I'm on vacation right now, but when I'm back at work, I'll start a discussion
with the different component owners (although I can't promise that there'll be
a positive outcome :-))

What do you mean by 'the length is checked on the specified size'? Can you
give me some examples?

Cheers,
Helmut

--
Helmut K. C. Tessarek
DB2 Performance and Development

/*
Thou shalt not follow the NULL pointer for chaos and madness
await thee at its end.
*/

Reply With Quote
  #15  
Old   
Frederik Engelen
 
Posts: n/a

Default Re: UTF-8 databases - 10-16-2010 , 04:43 AM



On 16 okt, 07:00, Helmut Tessarek <tessa... (AT) evermeet (DOT) cx> wrote:
Quote:
Hi Frederik,

On 15.10.10 1:42 , Frederik Engelen wrote:

I would also be very happy with basic functionality like this:

- the new varchar type reserves 4 times the space specified
- the length is checked on the specified size (duh...)
- all scalar functions (LENGTH, etc...) automatically apply the
CODEUNITS32 keyword on this data type

I'm on vacation right now, but when I'm back at work, I'll start a discussion
with the different component owners (although I can't promise that there'll be
a positive outcome :-))

What do you mean by 'the length is checked on the specified size'? Can you
give me some examples?

Cheers,
*Helmut

--
Helmut K. C. Tessarek
DB2 Performance and Development

/*
* *Thou shalt not follow the NULL pointer for chaos and madness
* *await thee at its end.
*/
Helmut,

First of all, enjoy your holiday.

About that sentence, it would probably have been clearer if I just
left it out. I'm pretty sure you already got that part.

All I meant to say is that the content of the cell shouldn't be
allowed to be longer than whatever you sprcified at creation time,
even though there is extra space reserved.

Thanks for taking this serious, I really appreciate that there is a
such a channel to people from IBM possible here. Perhaps we can
organise a popularity poll? Other people might think this is a bad
idea.

--
Frederik Engelen

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.