dbTalk Databases Forums  

Strange Euro sign handling

comp.databases.oracle.misc comp.databases.oracle.misc


Discuss Strange Euro sign handling in the comp.databases.oracle.misc forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Franz Kruse
 
Posts: n/a

Default Strange Euro sign handling - 08-11-2012 , 09:49 AM






Dear Oracle experts,

maybe one of you can explain a phenomenon we observe:

We are using an Oracle 9i database on a Solaris server as part of a web
based system with both Perl and Java programs accessing the database.
The database uses the WEISO8859P1 (Latin-1) character set, in which the
Euro sign does not exist.

Nevertheless, the system does handle the Euro sign. We can insert Euro
signs through web forms and get them properly displayed in web pages
generated from database queries - as long as done through Perl programs.
When doing queries from Java, the situation is different:

- For VARCHAR fields, the ResultSet method getString() yields
0x0080 for a Euro sign (which is the Windows-1252 encoding).

- For CLOBs (CLOB object obtained by the OracleResultSet
method getCLOB()), the result depends on the kind of access:

- When using getAsciiStream() and creating an InputStreamReader
with encoding ISO-8859-1 from it, we get 0x00FD for a Euro sign.

- When using getCharacterStream(), we get 0xFFFD (which is the
Unicode representation of an invalid character).

What does the Oracle database actually do with the Euro signs? How can
the observed effects be explained?

Franz

Reply With Quote
  #2  
Old   
Peter J. Holzer
 
Posts: n/a

Default Re: Strange Euro sign handling - 08-12-2012 , 04:58 PM






On 2012-08-11 14:49, Franz Kruse <Franz.Kruse (AT) t-online (DOT) de> wrote:
Quote:
maybe one of you can explain a phenomenon we observe:

We are using an Oracle 9i database on a Solaris server as part of a web
based system with both Perl and Java programs accessing the database.
The database uses the WEISO8859P1 (Latin-1) character set, in which the
Euro sign does not exist.

Nevertheless, the system does handle the Euro sign. We can insert Euro
signs through web forms and get them properly displayed in web pages
generated from database queries - as long as done through Perl programs.
Let me guess:

1) The web forms have the content type "text/html; charset=iso-8859-1".

2) The Perl scripts have NLS_LANG set to WEISO8859P1 and just shuffle
byte strings between the database and the browser.

Quote:
When doing queries from Java, the situation is different:

- For VARCHAR fields, the ResultSet method getString() yields
0x0080 for a Euro sign (which is the Windows-1252 encoding).
For compatibility with Internet Explorer, most browsers assume
charset=windows-1252 when they encounter charset=iso-8859-1. Therefore,
when a user enters a Euro sign in a form, the browser will transmit it
as "%80" (instead of rejecting it). Similarly, when the browser reads
the byte 0x80 in a page, it will display a Euro sign (instead of an
"unknown character" symbol). So the Perl code will receive a character
0x80 from the form, store it in the database, retrieve it again, send
it to the browser and the browser will display a Euro sign. As long as
the Perl code doesn't try to do anything with that character it won't
notice that it isn't really a Euro sign.

That would probably work in Java, too, if you just passed on that 0x0080
character to the browser and claimed that the encoding is ISO-8859-1.
Since you noticed that it isn't working, you are probably doing
something different - maybe use utf8 encoding, maybe just some more
complex processing.



Quote:
- For CLOBs (CLOB object obtained by the OracleResultSet
method getCLOB()), the result depends on the kind of access:

- When using getAsciiStream() and creating an InputStreamReader
with encoding ISO-8859-1 from it, we get 0x00FD for a Euro sign.
I assume that 0x00FD is just 0xFFFD truncated to 8 bits.


Quote:
- When using getCharacterStream(), we get 0xFFFD (which is the
Unicode representation of an invalid character).
This is strange. It should also be 0x0080, it that is what is stored in
the database (did you check that? - use Perl or the SQL dump()
function). It is possible that getCharacterStream() decides that the
characters 0x80-0x9F in ISO-8859-1 are "invalid". It is also possible
that the database doesn't contain 0x80, because the Java code which
inserted the CLOB already decided that this isn't a valid character, but
that seems even more unlikely - for starters, what would it contain?
There is no way to convert 0xFFFD into ISO-8859-1.

hp


--
_ | Peter J. Holzer | Deprecating human carelessness and
Quote:
_|_) | Sysadmin WSR | ignorance has no successful track record.
| | hjp (AT) hjp (DOT) at |
__/ | http://www.hjp.at/ | -- Bill Code on asrg (AT) irtf (DOT) org

Reply With Quote
  #3  
Old   
Álvaro G. Vicario
 
Posts: n/a

Default Re: Strange Euro sign handling - 08-13-2012 , 02:24 AM



El 11/08/2012 16:49, Franz Kruse escribió/wrote:
Quote:
Dear Oracle experts,

maybe one of you can explain a phenomenon we observe:

We are using an Oracle 9i database on a Solaris server as part of a web
based system with both Perl and Java programs accessing the database.
The database uses the WEISO8859P1 (Latin-1) character set, in which the
Euro sign does not exist.

Nevertheless, the system does handle the Euro sign. We can insert Euro
signs through web forms and get them properly displayed in web pages
generated from database queries - as long as done through Perl programs.
When doing queries from Java, the situation is different:

- For VARCHAR fields, the ResultSet method getString() yields
0x0080 for a Euro sign (which is the Windows-1252 encoding).

- For CLOBs (CLOB object obtained by the OracleResultSet
method getCLOB()), the result depends on the kind of access:

- When using getAsciiStream() and creating an InputStreamReader
with encoding ISO-8859-1 from it, we get 0x00FD for a Euro sign.

- When using getCharacterStream(), we get 0xFFFD (which is the
Unicode representation of an invalid character).

What does the Oracle database actually do with the Euro signs? How can
the observed effects be explained?

Franz
It looks similar to a situation I've seen many times here in Spain.
Database is using Latin-1 but everyone assumes the default Windows
encoding which is Win-1252 in Western Europe. Application sends ¤ as
0x80 (Win-1252) and Oracle understands PAD (Latin-1). The exact opposite
happens when retrieving data and since ¤ gets displayed nobody notices
the error. This explains your VARCHAR behaviour.

Since you are also getting double byte characters there must be some
other encoding involved. No idea about that.

To inspect the exact column contents you can use DUMP():

SELECT DUMP('¤ÁÑ', 1016) FROM DUAL
-- Typ=96 Len=3 CharacterSet=WE8MSWIN1252: 80,c1,d1

SELECT DUMP(U'¤ÁÑ', 1016) FROM DUAL
-- Typ=96 Len=6 CharacterSet=AL16UTF16: 20,ac,0,c1,0,d1

http://download.oracle.com/docs/cd/B...htm#sthref1314


--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--

Reply With Quote
  #4  
Old   
Franz Kruse
 
Posts: n/a

Default Re: Strange Euro sign handling - 08-19-2012 , 11:06 AM



Peter J. Holzer:

Quote:
We are using an Oracle 9i database on a Solaris server as part of a web
based system with both Perl and Java programs accessing the database.
The database uses the WEISO8859P1 (Latin-1) character set, in which the
Euro sign does not exist.

Nevertheless, the system does handle the Euro sign. We can insert Euro
signs through web forms and get them properly displayed in web pages
generated from database queries - as long as done through Perl programs.

Let me guess:

1) The web forms have the content type "text/html; charset=iso-8859-1".

2) The Perl scripts have NLS_LANG set to WEISO8859P1 and just shuffle
byte strings between the database and the browser.

When doing queries from Java, the situation is different:

- For VARCHAR fields, the ResultSet method getString() yields
0x0080 for a Euro sign (which is the Windows-1252 encoding).

For compatibility with Internet Explorer, most browsers assume
charset=windows-1252 when they encounter charset=iso-8859-1. Therefore,
when a user enters a Euro sign in a form, the browser will transmit it
as "%80" (instead of rejecting it). Similarly, when the browser reads
the byte 0x80 in a page, it will display a Euro sign (instead of an
"unknown character" symbol). So the Perl code will receive a character
0x80 from the form, store it in the database, retrieve it again, send
it to the browser and the browser will display a Euro sign. As long as
the Perl code doesn't try to do anything with that character it won't
notice that it isn't really a Euro sign.
Good explanation.

Quote:
That would probably work in Java, too, if you just passed on that 0x0080
character to the browser and claimed that the encoding is ISO-8859-1.
Indeed, this simple case works in Java, too, as a test shows.

Quote:
Since you noticed that it isn't working, you are probably doing
something different - maybe use utf8 encoding, maybe just some more
complex processing.
The different behaviour shows only when CLOBs are involved, and we
retrieve the data only indirectly from the CLOBs through input streams
which we make into BufferedReaders on which we call readLine() to
actually get the data. Probably the Stream and Reader classes don't like
the 0x80 and declare it an invalid character:

Quote:
- For CLOBs (CLOB object obtained by the OracleResultSet
method getCLOB()), the result depends on the kind of access:

- When using getAsciiStream() and creating an InputStreamReader
with encoding ISO-8859-1 from it, we get 0x00FD for a Euro sign.

I assume that 0x00FD is just 0xFFFD truncated to 8 bits.
I had this idea, too, although I don't understand what that truncation
might come from.

Quote:
- When using getCharacterStream(), we get 0xFFFD (which is the
Unicode representation of an invalid character).

This is strange. It should also be 0x0080, it that is what is stored in
the database (did you check that? - use Perl or the SQL dump()
function).
The database contains 0x80 (entered directly through a Perl function).
So this behaviour seems logical, if the Reader returned by
getCharacterStream() rejects 0x80 as invalid.

Thanks for your help.

Franz

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.