dbTalk Databases Forums  

using COLLATE in WHERE clause

comp.databases.mysql comp.databases.mysql


Discuss using COLLATE in WHERE clause in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
MacMax
 
Posts: n/a

Default using COLLATE in WHERE clause - 11-11-2010 , 07:03 AM






Hi to everybody,
i would like to find every record where the inserted text is not UTF-8,
it's possible?

The official MySQL manual doen'st report anything like this.
Thanks in advance.

Max

--
http://www.maxori.it
L'Arte è la capacità di fare cose di nessun valore
e di riuscire a venderle
F. Zappa

Reply With Quote
  #2  
Old   
Jerry Stuckle
 
Posts: n/a

Default Re: using COLLATE in WHERE clause - 11-11-2010 , 07:11 AM






On 11/11/2010 8:03 AM, MacMax wrote:
Quote:
Hi to everybody,
i would like to find every record where the inserted text is not UTF-8,
it's possible?

The official MySQL manual doen'st report anything like this.
Thanks in advance.

Max

How can you tell if the text is utf-8 or not?

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex (AT) attglobal (DOT) net
==================

Reply With Quote
  #3  
Old   
MacMax
 
Posts: n/a

Default Re: using COLLATE in WHERE clause - 11-11-2010 , 07:24 AM



Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote:

Quote:
How can you tell if the text is utf-8 or not?
Ok, my problem is this:

this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252,
WIN1252, etc, etc), i would like to extract only the records NOT UTF8.
Now i use a PHP function to filter the results with mb_detect_encode()
but is too slow, because there are thousands of records.
I don't know if is possible with SQL.

Thanks

Max

P.S. sorry for my english
--
http://www.maxori.it
L'Arte è la capacità di fare cose di nessun valore
e di riuscire a venderle
F. Zappa

Reply With Quote
  #4  
Old   
Axel Schwenke
 
Posts: n/a

Default Re: using COLLATE in WHERE clause - 11-11-2010 , 08:33 AM



macmax (AT) nospam (DOT) it (MacMax) wrote:
Quote:
Ok, my problem is this:

this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252,
WIN1252, etc, etc), i would like to extract only the records NOT UTF8.
OK.

Quote:
Now i use a PHP function to filter the results with mb_detect_encode()
but is too slow, because there are thousands of records.
I don't know if is possible with SQL.
There is no direct equivalent of mb_detect_encoding() in SQL.
I don't even believe something like that can be implemented in
a sensible way at all. The best you can get is a probability
that a string uses a certain encoding.

But if you know what *exactly* this function does, you might be
able to express it in SQL. There are some alternatives shown in
the comments here:

http://php.net/manual/en/function.mb...t-encoding.php

Still I doubt it would be any faster. The mb-* functions in PHP
are compiled C code. If those are slow, then a pure SQL
implementation will be even slower.


XL

Reply With Quote
  #5  
Old   
Jerry Stuckle
 
Posts: n/a

Default Re: using COLLATE in WHERE clause - 11-11-2010 , 10:14 AM



On 11/11/2010 8:24 AM, MacMax wrote:
Quote:
Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote:

How can you tell if the text is utf-8 or not?

Ok, my problem is this:

this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252,
WIN1252, etc, etc), i would like to extract only the records NOT UTF8.
Now i use a PHP function to filter the results with mb_detect_encode()
but is too slow, because there are thousands of records.
I don't know if is possible with SQL.

Thanks

Max

P.S. sorry for my english
If you check, mb_detect_encode() is only a guess - an educated guess,
but still only a guess. It cannot and does not pretend to be an
absolute determination of the charset. Such code would be impossible.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex (AT) attglobal (DOT) net
==================

Reply With Quote
  #6  
Old   
Peter H. Coffin
 
Posts: n/a

Default Re: using COLLATE in WHERE clause - 11-11-2010 , 10:55 AM



On Thu, 11 Nov 2010 14:24:41 +0100, MacMax wrote:
Quote:
Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote:

How can you tell if the text is utf-8 or not?

Ok, my problem is this:

this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252,
WIN1252, etc, etc), i would like to extract only the records NOT UTF8.
Now i use a PHP function to filter the results with mb_detect_encode()
but is too slow, because there are thousands of records.
I don't know if is possible with SQL.
That's the fast way do it, unfortuantely. Your other alternative is
essentially extract each value as hex/binary, try to iconv it into UTF-8
and if iconv complains that it can't, then it's not UTF-8 and you can
flag it for review. The problem is that (looking at the actual bits)
ASCII is cp1252 with one bit always 0, UTF-8 is cp1252 with some
*combinations* of character sequences being disallowed, and it's very
very difficult for a program that is not written to the specific context
of your data to tell the difference.

--
I hate mornings. I know they hate me back, too.
-- Joel Gluth

Reply With Quote
  #7  
Old   
MacMax
 
Posts: n/a

Default Re: using COLLATE in WHERE clause - 11-11-2010 , 11:18 AM



Peter H. Coffin <hellsop (AT) ninehells (DOT) com> wrote:

Quote:
On Thu, 11 Nov 2010 14:24:41 +0100, MacMax wrote:
Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote:

How can you tell if the text is utf-8 or not?

Ok, my problem is this:

this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252,
WIN1252, etc, etc), i would like to extract only the records NOT UTF8.
Now i use a PHP function to filter the results with mb_detect_encode()
but is too slow, because there are thousands of records.
I don't know if is possible with SQL.

That's the fast way do it, unfortuantely. Your other alternative is
essentially extract each value as hex/binary, try to iconv it into UTF-8
and if iconv complains that it can't, then it's not UTF-8 and you can
flag it for review. The problem is that (looking at the actual bits)
ASCII is cp1252 with one bit always 0, UTF-8 is cp1252 with some
*combinations* of character sequences being disallowed, and it's very
very difficult for a program that is not written to the specific context
of your data to tell the difference.
Thanks to everybody.
My final solution is a self-made php function to convert any record in
html entities.
Any record have a check button to add himself into an array with ID and
charset, in accordance with the charset the function apply the right
conversion.
I've taked this decision afetr you advices.
Long, very long job, but now everything it's ok.

Max
--
http://www.maxori.it
L'Arte è la capacità di fare cose di nessun valore
e di riuscire a venderle
F. Zappa

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.