dbTalk Databases Forums  

Re: [BUGS] Invalid EUC_JP char seq bug?

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss Re: [BUGS] Invalid EUC_JP char seq bug? in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Jean-Christian Imbeault
 
Posts: n/a

Default Re: [BUGS] Invalid EUC_JP char seq bug? - 07-01-2003 , 09:44 PM






Tatsuo Ishii wrote:
Quote:
Since you did not show us exact query you send to PostgreSQL
I can't show the exact query because it is generated by PHP. I can
however show you the code that generates the query:


$words = $_GET["words"];
$sql = "select id from products where name like '$words'";
$conn = pg_connect("host=$DB_IP port=5432 dbname=$DB_NAME user=postgres");
$res = pg_query($conn, $sql);

The GET query string was:

words=%8f%ac%90%ec%96%be%93%fa%8d%81

I think that PHP does some internal translation of this before passing
it on though.

Quote:
I assume the query passed to PostgreSQL is:

select id from products where name like 'string';
Yes.

Quote:
where string is "0x8fac90ec96be93fa8d81".
That I don't know.

Quote:
If the string is supposed to be an EUC_JP, it would be parsed as follows:

8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character
[snip ...]

Ah ... so it is not an EUC-JP string but an SJIS string. Postgres was
right. That answers my question. Thanks!

Quote:
PS I have also had the error pop up with this string:

search_words=%B7%F6%BA%7E
select id from products where name like '??~'
Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)


This is definitly a bad EUC_JP.
According to a PHP developer in my bug report
(http://bugs.php.net/bug.php?id=24309&edit=2):

"URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
believe encoding detection of mbstring works fine in this case.
B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
correct EUC-JP byte sequence."

I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
bug report to PHP and pointed this out. Hopefully the developer will see
that this sequence is incorrect EUC-JP and that PHP failed to detect this

I *knew* there was nothing wrong with Postgres

Thanks!

Jean-Christian Imbeault

PS I posted to HACKERS a few weeks ago about another bug (a real one
in the EUC-JP translation having to do with the WAVE DASH. I'll repost
here on the BUGS list, could you let me know the status of that BUG? Thanks!


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match


Reply With Quote
  #2  
Old   
Tatsuo Ishii
 
Posts: n/a

Default Re: [BUGS] Invalid EUC_JP char seq bug? - 07-02-2003 , 05:01 AM






Quote:
search_words=%B7%F6%BA%7E
select id from products where name like '??~'
Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)


This is definitly a bad EUC_JP.

According to a PHP developer in my bug report
(http://bugs.php.net/bug.php?id=24309&edit=2):

"URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
believe encoding detection of mbstring works fine in this case.
B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
correct EUC-JP byte sequence."

I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
bug report to PHP and pointed this out. Hopefully the developer will see
that this sequence is incorrect EUC-JP and that PHP failed to detect this
In the EUC_JP encoding there are some rules:

1) if the first byte is 0x8e then second byte is a JIS 0201 character
and should be greater than 0x7f

2) else if the first byte is 0x8f then second and third byte is a JIS
0212 character and they should be greater than 0x7f

3) else if the first byte is greater than 0x7f then second and third
byte is a JIS 0208 character and they should be greater than 0x7f

4) else the byte is ASII and should be eqaul to or less than 0x7f

Apparently:

B7F6: this is ok. we can apply rule #3
BA7E: this is not good, since it satisfies non of rule #1 to #4

Quote:
Thanks!

Jean-Christian Imbeault

PS I posted to HACKERS a few weeks ago about another bug (a real one
in the EUC-JP translation having to do with the WAVE DASH. I'll repost
here on the BUGS list, could you let me know the status of that BUG? Thanks!
Sorry for the delay. In EUC-JP <--> Unicode translation, WAVE DASH is
always a problem since there are several different mappings among
different vendors/standards. I think I need more time to solve this.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
message can get through to the mailing list cleanly


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.