![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I have an HTML document whose charset is utf8 I am attempting to strip all the HTML tags to create a text-only version which I am then inserting into a column whose collation is utf8_unicode_ci. This works for most documents, but I have one which consists mainly of chinese characters which causes the following error: Fatal Error: MySQL: Incorrect string value: '\xE6\xA8 \xE5\xA5\xBD...' If the input string is encoded in utf8 then why can't I write it to a utf8 column? How can the string contain invalid characters if they display correctly in the HTML document? |
#3
| |||
| |||
|
|
Hi Tony, "Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote: I have an HTML document whose charset is utf8 I am attempting to strip all the HTML tags to create a text-only version which I am then inserting into a column whose collation is utf8_unicode_ci. This works for most documents, but I have one which consists mainly of chinese characters which causes the following error: Fatal Error: MySQL: Incorrect string value: '\xE6\xA8 \xE5\xA5\xBD...' If the input string is encoded in utf8 then why can't I write it to a utf8 column? How can the string contain invalid characters if they display correctly in the HTML document? MySQL does not support full Unicode, but only the 16-bit subset aka code points U+0000 to U+FFFF aka the Basic Multilingual Plane. Consequently MySQL rejects any UTF8 sequence representing a code point outside the BMP. It seems this is happening here. A significant number of chinese ideographs is not on the BMP but on plane 2. See http://en.wikipedia.org/wiki/Unicode_plane If you want to store Chinese in MySQL, one of those might be a better choice: http://dev.mysql.com/doc/refman/5.1/...sian-sets.html |
#4
| |||
| |||
|
|
"Axel Schwenke" <axel.schwenke (AT) gmx (DOT) de> wrote in message MySQL does not support full Unicode, but only the 16-bit subset aka code points U+0000 to U+FFFF aka the Basic Multilingual Plane. Consequently MySQL rejects any UTF8 sequence representing a code point outside the BMP. It seems this is happening here. How do I detect that the string contains invalid characters BEFORE I attempt to store it? The document charset is given as 'utf-8', so it is reasonable to assume that it contains valid utf-8 characters. |
|
BTW, my programing language is PHP. |
#5
| |||
| |||
|
|
"Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote: "Axel Schwenke" <axel.schwenke (AT) gmx (DOT) de> wrote in message MySQL does not support full Unicode, but only the 16-bit subset aka code points U+0000 to U+FFFF aka the Basic Multilingual Plane. Consequently MySQL rejects any UTF8 sequence representing a code point outside the BMP. It seems this is happening here. How do I detect that the string contains invalid characters BEFORE I attempt to store it? The document charset is given as 'utf-8', so it is reasonable to assume that it contains valid utf-8 characters. This is not about the validity of utf-8 sequences (though those can be a problem too) but about the fact that utf-8 can encode code points from outside the BMP. Those are not supported by MySQL. see: http://en.wikipedia.org/wiki/UTF-8 Luckily (for you) any code point outside the BMP requires a 4-byte utf-8 sequence. Hence all shorter sequences are safe for MySQL. BTW, my programing language is PHP. I'm not too familiar with recent PHP development, but http://www.php.net/manual/en/functio...k-encoding.php looks promising. Also the first contributor note shows an alternative implementation to check for valid utf-8 sequences. This can easily be expanded to flag 4-byte sequences as error. XL |
#6
| |||
| |||
|
#7
| |||
| |||
|
|
This is very strange. When I try this update in my PHP program: UPDATE email_msg SET size=771, message_text='? ?:\n\n ????2010?11?25-26? ? ? ? 12?09-10? ? ? ?12?16-17? ?? ????:\n\n? ????2010???????2011????????????????????????? ????!\n\n????????????????????????????????????????? --? ? ?? ??!\n\n\n????????????:baoming_gz (AT) 126 (DOT) com??! ( ?????????!)??! \n\n\n\n?-?-?-?:0-2-0--3-9-9-2-6-2-7-5?(0)1-3-7-1-9-0-2-5-4-7-6 ??? ????\n\n?-?-?-?:baoming_gz (AT) 126 (DOT) com\n\n\n????????,?????"??"?"delete"?:tuidin01 (AT) 163 (DOT) com ??!', revised_date='2010-11-18 15:48:09', revised_user='AJM' WHERE email_id='159' it fails with the following error: "Incorrect string value: '\xA0\xA0 \xE6\x88\x91...' for column 'message_text' at row 1" When I try the SAME update through SQL-Front or phpMyAdmin it works! Why is this? |
#8
| |||
| |||
|
|
["Followup-To:" header set to comp.databases.mysql.] On Thu, 18 Nov 2010 16:54:47 -0000, Tony Marston wrote: This is very strange. When I try this update in my PHP program: UPDATE email_msg SET size=771, message_text='? ?:\n\n ????2010?11?25-26? ? ? ? 12?09-10? ? ? ?12?16-17? ?? ????:\n\n? ????2010???????2011????????????????????????? ????!\n\n????????????????????????????????????????? --? ? ?? ??!\n\n\n????????????:baoming_gz (AT) 126 (DOT) com??! ( ?????????!)??! \n\n\n\n?-?-?-?:0-2-0--3-9-9-2-6-2-7-5?(0)1-3-7-1-9-0-2-5-4-7-6 ??? ????\n\n?-?-?-?:baoming_gz (AT) 126 (DOT) com\n\n\n????????,?????"??"?"delete"?:tuidin01 (AT) 163 (DOT) com ??!', revised_date='2010-11-18 15:48:09', revised_user='AJM' WHERE email_id='159' it fails with the following error: "Incorrect string value: '\xA0\xA0 \xE6\x88\x91...' for column 'message_text' at row 1" When I try the SAME update through SQL-Front or phpMyAdmin it works! Why is this? If it were the SAME, it would work the same. MySQL gives this error when there's invalid UTF-8 byte sequences, like a continuation byte without starting byte, or a starting byte that is not followed by a continuation byte... Find out what the hex for what you're trying to stick in message_text and I bet it won't be what it should be. |
#9
| |||
| |||
|
|
"Peter H. Coffin" <hellsop (AT) ninehells (DOT) com> wrote in message news:slrniebqgp.1g0.hellsop (AT) abyss (DOT) ninehells.com... If it were the SAME, it would work the same. MySQL gives this error when there's invalid UTF-8 byte sequences, like a continuation byte without starting byte, or a starting byte that is not followed by a continuation byte... Find out what the hex for what you're trying to stick in message_text and I bet it won't be what it should be. The error message was reporting a problem with the hex value \xA0 (decimal 160) which represents ' ' or the non-breaking space. |
|
I discovered that instead of replacing ' ' with chr(160) that I needed to replace it with chr(194).chr(160). I don't now why the chr(194) is necessary, but it solves my problem. |
#10
| |||
| |||
|
|
I have tried that function, but it never finds any utf-8 character in that string which is greater than 3 bytes. Is it possible for any of these shorter strings to contain values that MySQL doesn't like? |
![]() |
| Thread Tools | |
| Display Modes | |
| |