dbTalk Databases Forums  

[BUGS] Bug concerning regular expressions and UTF-8

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss [BUGS] Bug concerning regular expressions and UTF-8 in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Helmar Spangenberg
 
Posts: n/a

Default [BUGS] Bug concerning regular expressions and UTF-8 - 01-21-2006 , 07:56 PM






Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by=20
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produ=
ce=20
a faulty result is

select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';=20

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see als=
o=20
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei=
=20
Regular Expressions"). It seems that this problem does not exist in Windows=
=20
based installations.

It seems to me that this bug is originated in the file=20
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and=
=20
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and=
=20
tolower(unsigned char) which definitely are the wrong choice for UTF8=20
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and=20
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",=20
which lead to the correct results of=20
select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';

Since I don't have any idea concerning the side effects of this change, ple=
ase=20
let me know as soon as an "official" patch is available - I definitely do=
=20
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: hspangenberg (AT) frey (DOT) de

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.