![]() | |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hi to everybody, i would like to find every record where the inserted text is not UTF-8, it's possible? The official MySQL manual doen'st report anything like this. Thanks in advance. Max |
#3
| |||
| |||
|
|
How can you tell if the text is utf-8 or not? |
#4
| |||
| |||
|
|
Ok, my problem is this: this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252, WIN1252, etc, etc), i would like to extract only the records NOT UTF8. |
|
Now i use a PHP function to filter the results with mb_detect_encode() but is too slow, because there are thousands of records. I don't know if is possible with SQL. |
#5
| |||
| |||
|
|
Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote: How can you tell if the text is utf-8 or not? Ok, my problem is this: this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252, WIN1252, etc, etc), i would like to extract only the records NOT UTF8. Now i use a PHP function to filter the results with mb_detect_encode() but is too slow, because there are thousands of records. I don't know if is possible with SQL. Thanks Max P.S. sorry for my english |
#6
| |||
| |||
|
|
Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote: How can you tell if the text is utf-8 or not? Ok, my problem is this: this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252, WIN1252, etc, etc), i would like to extract only the records NOT UTF8. Now i use a PHP function to filter the results with mb_detect_encode() but is too slow, because there are thousands of records. I don't know if is possible with SQL. |
#7
| |||
| |||
|
|
On Thu, 11 Nov 2010 14:24:41 +0100, MacMax wrote: Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote: How can you tell if the text is utf-8 or not? Ok, my problem is this: this is an old DB, and is full of a lot of text (UTF8, ASCII, cp1252, WIN1252, etc, etc), i would like to extract only the records NOT UTF8. Now i use a PHP function to filter the results with mb_detect_encode() but is too slow, because there are thousands of records. I don't know if is possible with SQL. That's the fast way do it, unfortuantely. Your other alternative is essentially extract each value as hex/binary, try to iconv it into UTF-8 and if iconv complains that it can't, then it's not UTF-8 and you can flag it for review. The problem is that (looking at the actual bits) ASCII is cp1252 with one bit always 0, UTF-8 is cp1252 with some *combinations* of character sequences being disallowed, and it's very very difficult for a program that is not written to the specific context of your data to tell the difference. |
![]() |
| Thread Tools | |
| Display Modes | |
| |