dbTalk Databases Forums  

Two great Unicode resources

comp.databases.pick comp.databases.pick


Discuss Two great Unicode resources in the comp.databases.pick forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Kevin Powick
 
Posts: n/a

Default Two great Unicode resources - 11-22-2011 , 04:55 PM






Two great on-line utilities from Richard Ishida (he works for the W3C)
for working with Unicode.


http://rishida.net/tools/conversion/
http://rishida.net/scripts/uniview/uniview.php

--
Kevin Powick

Reply With Quote
  #2  
Old   
Tony Gravagno
 
Posts: n/a

Default Re: Two great Unicode resources - 11-22-2011 , 07:17 PM






Funny you should post this just now. I've been working on a Facebook
app to be used by people around the world, back-ended by D3. I need
to store a lot of data in hex notation rather than as native text
simply because D3 doesn't natively support Unicode. We've discussed
this in CDP before (google the group for "big5" and "japanese"),
There are better solutions, and for my purposes now this is OK, but in
general it seems rather inelegant.

T

Kevin Powick wrote:

Quote:
Two great on-line utilities from Richard Ishida (he works for the W3C)
for working with Unicode.
http://rishida.net/tools/conversion/
http://rishida.net/scripts/uniview/uniview.php

Reply With Quote
  #3  
Old   
pschellenbach
 
Posts: n/a

Default Re: Two great Unicode resources - 11-23-2011 , 12:06 PM



Tony - why are you using hex instead of UTF-8 for storing Unicode on
D3? One of the "magic" things about UTF-8 and MultiValue is that UTF-8
never has 0xF5 - 0xFF data bytes, so you can continue to use normal
code points for system delimiters. Just curious.

Thanks,

Peter Schellenbach

Reply With Quote
  #4  
Old   
Ross Ferris
 
Posts: n/a

Default Re: Two great Unicode resources - 11-23-2011 , 02:46 PM



On Nov 24, 5:06*am, pschellenbach <p... (AT) asent (DOT) com> wrote:
Quote:
Tony - why are you using hex instead of UTF-8 for storing Unicode on
D3? One of the "magic" things about UTF-8 and MultiValue is that UTF-8
never has 0xF5 - 0xFF data bytes, so you can continue to use normal
code points for system delimiters. Just curious.

Thanks,

Peter Schellenbach
Agree .... that is how we did it when we had to tackle Thai,
Vietnamese & Chinese for Visage .... was actually "easier" in some
respects than UV/NLS

Reply With Quote
  #5  
Old   
Tony Gravagno
 
Posts: n/a

Default Re: Two great Unicode resources - 11-25-2011 , 05:08 PM



Ross Ferris wrote:

Quote:
On Nov 24, 5:06*am, pschellenbach wrote:
Tony - why are you using hex instead of UTF-8 for storing Unicode on
D3? One of the "magic" things about UTF-8 and MultiValue is that UTF-8
never has 0xF5 - 0xFF data bytes, so you can continue to use normal
code points for system delimiters. Just curious.

Thanks,

Peter Schellenbach

Agree .... that is how we did it when we had to tackle Thai,
Vietnamese & Chinese for Visage .... was actually "easier" in some
respects than UV/NLS
Thanks for the feedback.

While I have experience speaking other languages, character handling
like this is admittedly not my area of expertise. I welcome input
from those with more experience in this area, and will be happy to
consider other mechanisms that work across databases, OS's, and
communications tools. As I encourage others here to say from time to
time, in this case I simply don't know enough on this topic to make
better decisions.

The specific hex mechanism was just a choice of the minute to get the
data in and out, and it's easily changed. I'm only storing data in
the DBMS, not searching, sorting, or doing any other string
manipulation. Data exchanged with Facebook is URL-encoded UTF8 and I
could as easily have stored it in that format, but wanted to strip the
% signs. At some point I might even remove that "optimization" and
store exactly what's transmitted with FB.

My code has transports for QMClient, mv.NET, UniObjects, MVSP, and
SSH, and any one of these can be used to any DBMS simply by changing a
config file. That buys me a ton of versatility for business and
technical reasons.

Here is the key factor in my thinking... When moving data with
different communications tools, we occasionally see low-end or
high-end bytes getting lost as they are internally used as delimiters.
So as an example, if product X happens to use x02 and x03 as start/end
data markers, the data flow will be interrupted when a chinese or
arabic name includes 0A02 or 031C. mv.NET (maybe other products?) has
configuration settings that specifically remap 8bit characters to 7bit
to accommodate differences in how different OS's and databases handle
these things, but that doesn't help if we get low-end characters. I
don't want to code for underlying transports and OS differences, so
putting "pure ASCII" on the wire as hex pairs was a decision of
self-defense compared to technical elegance. If I start working with
some SOA that uses UTF16, this code will all still work as-is - I
don't know if one could say the same for any other mechanism that puts
actual UTF characters on the wire. Again, it's just defensive coding.

BTW, over a decade ago a serious engineering effort was undertaken to
make D3 natively support both Russian and Thai. I don't know the
status of that feature but just FYI, it's in there.

At some point I might move any project like the one I'm working on now
to a DBMS that has full support for NLS, UTF16, or whatever seems to
be the best LCD for storage and communications. An environment that
natively supports unicode may have advantages over those that do not.

T

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.