dbTalk Databases Forums  

UNICODE and regex character classes

comp.databases.postgresql.novice comp.databases.postgresql.novice


Discuss UNICODE and regex character classes in the comp.databases.postgresql.novice forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
David Norris
 
Posts: n/a

Default UNICODE and regex character classes - 08-04-2004 , 05:10 PM






All,

I'm trying to create a regular expression that plays nice with UNICODE
strings. I'd like to allow any alphabetic or digit character (as
defined by its UNICODE category) in a username. I've already set the
database's encoding to UNICODE and it's working properly as far as
being able to store and retrieve proper multibyte strings.

So I tried this regular expression:
"^[[:alpha:]][[:alpha:][:digit:]_]{2,}$". But :alpha: only matches
"pure" ASCII alphabetics, and [:digit:] only matches ASCII '0' thru
'9'. Is there another named class I can use for this, like
[:unicodealpha:]? If not, what's the best way to achieve this?

I wasn't able to find anything on Google, so I would really be
grateful for suggestions or links to websites that talk about this.

Thanks so much!
--
David Norris
danorris (AT) gmail (DOT) com

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


Reply With Quote
  #2  
Old   
David Norris
 
Posts: n/a

Default Re: UNICODE and regex character classes - 08-04-2004 , 08:24 PM






I should add that I know about locales and saw in the docs how the
character classes are affected by CTYPE. What I'm confused about is
what to do when you want *no* locale. In my regular expressions I'd
like to allow any character that someone might consider a letter in
their native language. Accented Latin characters in Western languages,
Cyrillic, Arabic script, etc. But not non-alphabetic characters like
punctuation, decorations and so on. My goal is for the database to be
as language-neutral as possible.

Thanks again,
--
David Norris
danorris (AT) gmail (DOT) com

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html


Reply With Quote
  #3  
Old   
Tom Lane
 
Posts: n/a

Default Re: UNICODE and regex character classes - 08-04-2004 , 09:39 PM



David Norris <danorris (AT) gmail (DOT) com> writes:
Quote:
So I tried this regular expression:
"^[[:alpha:]][[:alpha:][:digit:]_]{2,}$". But :alpha: only matches
"pure" ASCII alphabetics, and [:digit:] only matches ASCII '0' thru
'9'. Is there another named class I can use for this, like
[:unicodealpha:]? If not, what's the best way to achieve this?
The regex character classes really ought to be encoding- and
locale-aware. Right now they are not, but possibly something
similar to what I recently did to the upper/lower/initcap functions
would work --- that is, rely on the <wctype.h> C library instead of
<ctype.h>. If you feel like working on this, the regex stubs are
in src/backend/regex/regc_locale.c, and the upper/lower change
is in src/backend/utils/adt/oracle_compat.c:
http://developer.postgresql.org/cvsw...ext& tr2=1.53

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html



Reply With Quote
  #4  
Old   
David Norris
 
Posts: n/a

Default Re: UNICODE and regex character classes - 08-05-2004 , 11:49 AM



Quote:
The regex character classes really ought to be encoding- and
locale-aware. Right now they are not, but possibly something
similar to what I recently did to the upper/lower/initcap functions
would work --- that is, rely on the <wctype.h> C library instead of
ctype.h>. If you feel like working on this, the regex stubs are
in src/backend/regex/regc_locale.c, and the upper/lower change
is in src/backend/utils/adt/oracle_compat.c:
I'd love to contribute if I can. But I'm afraid I'm no i18n (or pg
codebase) expert and would need some guidance. If you can answer a
couple questions to get me started I'll gladly see if I can get some
working code.

How does pg_wchar (= unsigned int) work... is it always going to be a
straight Unicode character? Can I safely cast it to a wchar_t, is
there a suitable conversion function in wchar.h, or will I need to
write one that looks at some encoding variables?

Suppose I successfully update the regex char classification functions.
How could I set up an encoding/locale so that the character classes in
regular expressions would be locale-neutral? Remember my end goal is
to have some character class that matches *any* character which
Unicode calls a Letter. I don't want locale-awareness so much as I
want locale-neutrality.

Thanks,
--
David Norris
danorris (AT) gmail (DOT) com

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.