Ross Ferris wrote:
Quote:
On Nov 24, 5:06*am, pschellenbach wrote:
Tony - why are you using hex instead of UTF-8 for storing Unicode on
D3? One of the "magic" things about UTF-8 and MultiValue is that UTF-8
never has 0xF5 - 0xFF data bytes, so you can continue to use normal
code points for system delimiters. Just curious.
Thanks,
Peter Schellenbach
Agree .... that is how we did it when we had to tackle Thai,
Vietnamese & Chinese for Visage .... was actually "easier" in some
respects than UV/NLS |
Thanks for the feedback.
While I have experience speaking other languages, character handling
like this is admittedly not my area of expertise. I welcome input
from those with more experience in this area, and will be happy to
consider other mechanisms that work across databases, OS's, and
communications tools. As I encourage others here to say from time to
time, in this case I simply don't know enough on this topic to make
better decisions.
The specific hex mechanism was just a choice of the minute to get the
data in and out, and it's easily changed. I'm only storing data in
the DBMS, not searching, sorting, or doing any other string
manipulation. Data exchanged with Facebook is URL-encoded UTF8 and I
could as easily have stored it in that format, but wanted to strip the
% signs. At some point I might even remove that "optimization" and
store exactly what's transmitted with FB.
My code has transports for QMClient, mv.NET, UniObjects, MVSP, and
SSH, and any one of these can be used to any DBMS simply by changing a
config file. That buys me a ton of versatility for business and
technical reasons.
Here is the key factor in my thinking... When moving data with
different communications tools, we occasionally see low-end or
high-end bytes getting lost as they are internally used as delimiters.
So as an example, if product X happens to use x02 and x03 as start/end
data markers, the data flow will be interrupted when a chinese or
arabic name includes 0A02 or 031C. mv.NET (maybe other products?) has
configuration settings that specifically remap 8bit characters to 7bit
to accommodate differences in how different OS's and databases handle
these things, but that doesn't help if we get low-end characters. I
don't want to code for underlying transports and OS differences, so
putting "pure ASCII" on the wire as hex pairs was a decision of
self-defense compared to technical elegance. If I start working with
some SOA that uses UTF16, this code will all still work as-is - I
don't know if one could say the same for any other mechanism that puts
actual UTF characters on the wire. Again, it's just defensive coding.
BTW, over a decade ago a serious engineering effort was undertaken to
make D3 natively support both Russian and Thai. I don't know the
status of that feature but just FYI, it's in there.
At some point I might move any project like the one I'm working on now
to a DBMS that has full support for NLS, UTF16, or whatever seems to
be the best LCD for storage and communications. An environment that
natively supports unicode may have advantages over those that do not.
T