dbTalk Databases Forums  

Matching certain unicode characters with REGEXP

comp.databases.mysql comp.databases.mysql


Discuss Matching certain unicode characters with REGEXP in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Kai Schaetzl
 
Posts: n/a

Default Matching certain unicode characters with REGEXP - 02-12-2011 , 01:41 PM






I'm trying to match certain Unicode code points in a select query but all
my attempts and lots of searching on the net failed. (Actually, I want to
replace something and I want to check with this select if the replace
operation worked correctly.)

I tried for instance

WHERE col REGEXP '\u004c'

which should find any occurences of 'L', but it doesn't.

'\x4c' fails as well.

What's the correct syntax for MySQL?

The field is of type text collation utf8_roman_ci (or similar).

(Of course, I do not want to find 'L' this way. It's just a simplified
query to get a working syntax.)

Kai

Reply With Quote
  #2  
Old   
Peter H. Coffin
 
Posts: n/a

Default Re: Matching certain unicode characters with REGEXP - 02-14-2011 , 04:31 PM






On Sat, 12 Feb 2011 20:41:47 +0100, Kai Schaetzl wrote:
Quote:
I'm trying to match certain Unicode code points in a select query but all
my attempts and lots of searching on the net failed. (Actually, I want to
replace something and I want to check with this select if the replace
operation worked correctly.)

I tried for instance

WHERE col REGEXP '\u004c'

which should find any occurences of 'L', but it doesn't.

'\x4c' fails as well.

What's the correct syntax for MySQL?

The field is of type text collation utf8_roman_ci (or similar).

(Of course, I do not want to find 'L' this way. It's just a simplified
query to get a working syntax.)
Where on

http://dev.mysql.com/doc/refman/5.1/en/regexp.html

or

http://dev.mysql.com/doc/refman/5.1/...-matching.html

or

http://dev.mysql.com/doc/refman/5.1/...ng-syntax.html

is \u or \x discussed? Outside of the case on the last one where it says
explicitly 'For example, "\x" is just "x".'


--
Revenge is an integral part of forgiving and forgetting.
-- The BOFH

Reply With Quote
  #3  
Old   
Kai Schaetzl
 
Posts: n/a

Default Re: Matching certain unicode characters with REGEXP - 02-15-2011 , 04:58 AM



Peter H. Coffin schrieb am Mon, 14 Feb 2011 16:31:54 -0600:

Quote:
Where on

...

is \u or \x discussed?
\u and \x is standard syntax in various regular expression
implementations. I gave it as an example of what obvious things I tried.
That this is not supported is *exactly* the problem! How do I specify a
certain Unicode code point either for matching or for insertion if not
this way?

Kai
--
Conactive Internet Services, Berlin, Germany

Reply With Quote
  #4  
Old   
Peter H. Coffin
 
Posts: n/a

Default Re: Matching certain unicode characters with REGEXP - 02-15-2011 , 07:14 PM



On Tue, 15 Feb 2011 11:58:01 +0100, Kai Schaetzl wrote:
Quote:
Peter H. Coffin schrieb am Mon, 14 Feb 2011 16:31:54 -0600:

Where on

..

is \u or \x discussed?

\u and \x is standard syntax in various regular expression
implementations. I gave it as an example of what obvious things I tried.
That this is not supported is *exactly* the problem!
It's an inconvenience. But at least the manual spells out exactly which
method of parsing is behind REGEXP, and details what is supported, so
there's not much point in being unhappy that something that wasn't
suggested would work does not, in fact, work.

Quote:
How do I specify a
certain Unicode code point either for matching or for insertion if not
this way?
There's a couple of ways that would likely work. The most
straightforward way is to simply set the connection character set
correctly, and construct the pattern using the explicit characters that
you're looking for. If you're looking for 'L', send an L. If you're
looking for a 'þ', send a þ.

Another method would be to REGEXP on a HEX(my_col), which might be
entirely reasonable for small sets of data. It's a little ... suboptimal
on large datasets because it'll force a tablescan.

And while you're playing with those, remember that MySQL uses 0x
notation to express hex constants, and you might find the
CHAR(N,... [USING charset_name]) function pretty handy....

--
Revenge is an integral part of forgiving and forgetting.
-- The BOFH

Reply With Quote
  #5  
Old   
Kai Schaetzl
 
Posts: n/a

Default Re: Matching certain unicode characters with REGEXP - 02-21-2011 , 07:31 AM



Peter H. Coffin schrieb am Tue, 15 Feb 2011 19:14:25 -0600:

Quote:
And while you're playing with those, remember that MySQL uses 0x
notation to express hex constants, and you might find the
CHAR(N,... [USING charset_name]) function pretty handy....
Thanks for this. The solution was to use something like

UPDATE table SET fr=REPLACE(fr, ' ?',CONCAT(CHAR(0xc2a0),'?')) WHERE fr
LIKE '%?%'

and

SELECT fr FROM table WHERE fr LIKE CONCAT('%',CHAR(0xc2a0),'?','%')

I didn't test REGEXP.

Kai
--
Conactive Internet Services, Berlin, Germany

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.