dbTalk Databases Forums  

BUG #1268: Two different Unicode chars are treated as equal in a query

comp.databases.postgresql.bugs comp.databases.postgresql.bugs


Discuss BUG #1268: Two different Unicode chars are treated as equal in a query in the comp.databases.postgresql.bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
PostgreSQL Bugs List
 
Posts: n/a

Default BUG #1268: Two different Unicode chars are treated as equal in a query - 09-23-2004 , 09:48 PM







The following bug has been logged online:

Bug reference: 1268
Logged by: Kent Tong

Email address: kent (AT) cpttm (DOT) org.mo

PostgreSQL version: 7.4.5

Operating system: RedHat 9

Description: Two different Unicode chars are treated as equal in a
query

Details:

Steps:
1. Create a test database: "createdb -E Unicode -U postgres testdb".
2. Create a test table: "create table testtable (id varchar(100) primary
key);".
3. With JDBC, insert a record whose id contains unicode: "insert into
testtable values(<a unicode char whose code is 0x4e8c>);".
4. With JDBC, try to retrieve a record whose id contains a different unicde:
"select from testtable where id=<a unicode char whose code is 0x4e94>;". It
should not find any record but it finds the record created in step 3.

Here is the JUnit test case:

public class PgSQLTest extends TestCase {
private Connection conn;
protected void setUp() throws Exception {
conn = makeConnection();
}
protected void tearDown() throws Exception {
conn.close();
}
public void testChinese() throws Exception {
deleteAll();
insertRow();
PreparedStatement st =
conn.prepareStatement("select * from testtable where id=?");
try {
st.setString(1, "\u4e94");
ResultSet rs = st.executeQuery();
assertFalse(rs.next());
} finally {
st.close();
}
}

private void insertRow() throws SQLException {
PreparedStatement st =
conn.prepareStatement("insert into testtable values(?)");
st.setString(1, "\u4e8c");
st.executeUpdate();
st.close();
}
private void deleteAll() throws SQLException {
PreparedStatement st = conn.prepareStatement("delete from testtable");
st.executeUpdate();
st.close();
}
private Connection makeConnection()
throws ClassNotFoundException, SQLException {
Class.forName("org.postgresql.Driver");
Properties properties = new Properties();
properties.put("user", "postgres");
properties.put("password", "");
return DriverManager.getConnection(
"jdbcostgresql://localhost/testdb",
properties);
}
}



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo (AT) postgresql (DOT) org)


Reply With Quote
  #2  
Old   
Tom Lane
 
Posts: n/a

Default Re: BUG #1268: Two different Unicode chars are treated as equal in a query - 09-23-2004 , 10:06 PM






"PostgreSQL Bugs List" <pgsql-bugs (AT) postgresql (DOT) org> writes:
Quote:
Description: Two different Unicode chars are treated as equal in a
query
This would be a matter to take up with the maintainer of your locale
(which you didn't mention, but in any case it's a locale bug). We
just do what strcoll() tells us.

Note that it's possible this is a configuration error and not an
outright bug. Check to make sure that the locale you initdb'd
under is actually designed to work with UTF-8 data.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match



Reply With Quote
  #3  
Old   
Kent Tong
 
Posts: n/a

Default Re: BUG #1268: Two different Unicode chars are treated as - 09-23-2004 , 10:51 PM



Tom Lane wrote:

Quote:
"PostgreSQL Bugs List" <pgsql-bugs (AT) postgresql (DOT) org> writes:

Description: Two different Unicode chars are treated as equal in a
query


This would be a matter to take up with the maintainer of your locale
(which you didn't mention, but in any case it's a locale bug). We
just do what strcoll() tells us.
Thanks for the quick reply. The system locale is zh_TW.Big5. However,
I've tried setting it to "C" but the test case still fails.

In order to check if it's a locale bug, I've written a C program:

#include <locale.h>
#include <stdio.h>
#include <string.h>

int main() {
char *s1 = "\xe4\xba\x8c";
char *s2 = "\xe4\xba\x94";
setlocale(LC_ALL, "en.UTF-8");
//setlocale(LC_ALL, "zh.Big5"); //doesn't make any difference
printf("%d\n", strcoll(s1, s2));
return 0;
}

and compiled it and run it on that computer. It prints -1.
It means that strcoll is working.

Quote:
Note that it's possible this is a configuration error and not an
outright bug. Check to make sure that the locale you initdb'd
under is actually designed to work with UTF-8 data.
Does it matter? The encoding provided to initdb is just
a default for the databases to be created in the future.
When I used createdb, I did specify "-E unicode".

--
Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified
Manager of IT Dept, CPTTM
Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo (AT) postgresql (DOT) org



Reply With Quote
  #4  
Old   
Tom Lane
 
Posts: n/a

Default Re: BUG #1268: Two different Unicode chars are treated as - 09-23-2004 , 11:33 PM



Kent Tong <kent (AT) cpttm (DOT) org.mo> writes:
Quote:
Does it matter? The encoding provided to initdb is just
a default for the databases to be created in the future.
Yes it does, and you missed the point. I said *locale*, not *encoding*.
The LC_COLLATE and LC_CTYPE settings that prevail during initdb are
fixed and not alterable without re-initdb. (I agree that this sucks,
but that's how it is for now...)

Your test program doesn't prove a lot unless you are sure it's executing
under the same locale settings as the postmaster is running in.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
message can get through to the mailing list cleanly



Reply With Quote
  #5  
Old   
Kent Tong
 
Posts: n/a

Default Re: BUG #1268: Two different Unicode chars are treated as - 09-24-2004 , 04:30 AM



Tom Lane wrote:
Quote:
Yes it does, and you missed the point. I said *locale*, not *encoding*.
The LC_COLLATE and LC_CTYPE settings that prevail during initdb are
fixed and not alterable without re-initdb. (I agree that this sucks,
but that's how it is for now...)
You're right. After using:

initdb --locale zh_TW.utf8 /var/lib/pgsql/data

then it works fine!

Thanks again and sorry about any inconvenience.

--
Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified
Manager of IT Dept, CPTTM
Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.