dbTalk Databases Forums  

Cannot write utf8 data into a utf8 column

comp.databases.mysql comp.databases.mysql


Discuss Cannot write utf8 data into a utf8 column in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Tony Marston
 
Posts: n/a

Default Cannot write utf8 data into a utf8 column - 11-17-2010 , 11:34 AM






I have an HTML document whose charset is utf8 I am attempting
to strip all the HTML tags to create a text-only version which I am then
inserting into a column whose collation is utf8_unicode_ci. This works for
most documents, but I have one which consists mainly of chinese characters
which causes the following error:

Fatal Error: MySQL: Incorrect string value: '\xE6\xA8 \xE5\xA5\xBD...' for
column 'message_text' at row 1 (# 1366).
SQL query: UPDATE email_msg SET size=466,message_text='æ¨ å¥½ï¼\n\n
æå¸å°äº2010å¹´11æå¨ä¸ æµ·ãæ·± å³ãå 京 ç*å°ä¸¾åï¼\n\n\nãHRè§ç«
å¶åº¦ãåå·¥æåæ°åä¸è£åæå·§åæ
¸å¿å²ä½ç¹æ®æ¡æ¬¾çº¦å®ãå¬å¼è¯¾ï¼\n\nç±èµæ·±å³å¨æ³ä¸ åå·¥å³ç³»ç®¡çä¸å®¶ï¼é«çº§è®²å¸--éæ°¸æ££ä¸»è®²ï¼\n\n详ç»è¯¾ä»¶èµæè¯·åå¤é®ä»¶è³ï¼ba oming_gz (AT) 126 (DOT) com
ç´¢åï¼ ï¼ æ¥ä¿¡è¯·æ³¨æè¯¾ç¨åç§°ï¼ï¼è°¢è°¢ï¼
\n\n\nå¨-询-çµ-è¯ï¼0-2-0--3-9-9-2-6-2-7-5ã3-5-6-2-9-4-8-9 éåç
ãæ¾å°å§\n\næ¥-å-é®-ç®±ï¼baoming_gz (AT) 126 (DOT) com\n\n\nå¦ä¸éè¦æ*¤ç±»ä¿¡æ¯,请åé主é¢âç§»é¤âæâdel eteâå°ï¼tuidin01 (AT) 163 (DOT) com
谢谢!', revised_date='2010-11-16 10:34:24', revised_user='AJM' WHERE
email_id='156'
Error in line 1233 of file
'F:\Apache2\HTDOCS\radicore\includes\dml.mysqli.cl ass.inc'.
Host Info: localhost via TCP/IPServer Version: 5.1.52-communityClient
Version: 5.0.37, Character sets - client: utf8, - connection: utf8, -
database: utf8, - server: utf8

If the input string is encoded in utf8 then why can't I write it to a utf8
column? How can the string contain invalid characters if they display
correctly in the HTML document? Does anyone have any ideas?

--
Tony Marston

http://www.tonymarston.net
http://www.radicore.org

Reply With Quote
  #2  
Old   
Axel Schwenke
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - 11-17-2010 , 12:37 PM






Hi Tony,

"Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote:

Quote:
I have an HTML document whose charset is utf8 I am attempting
to strip all the HTML tags to create a text-only version which I am then
inserting into a column whose collation is utf8_unicode_ci. This works for
most documents, but I have one which consists mainly of chinese characters
which causes the following error:

Fatal Error: MySQL: Incorrect string value: '\xE6\xA8 \xE5\xA5\xBD...'

If the input string is encoded in utf8 then why can't I write it to a utf8
column? How can the string contain invalid characters if they display
correctly in the HTML document?
MySQL does not support full Unicode, but only the 16-bit subset aka
code points U+0000 to U+FFFF aka the Basic Multilingual Plane.
Consequently MySQL rejects any UTF8 sequence representing a code point
outside the BMP. It seems this is happening here.

A significant number of chinese ideographs is not on the BMP but on
plane 2. See http://en.wikipedia.org/wiki/Unicode_plane

If you want to store Chinese in MySQL, one of those might be a better
choice: http://dev.mysql.com/doc/refman/5.1/...sian-sets.html


HTH, XL

Reply With Quote
  #3  
Old   
Tony Marston
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - 11-18-2010 , 03:47 AM



"Axel Schwenke" <axel.schwenke (AT) gmx (DOT) de> wrote

Quote:
Hi Tony,

"Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote:

I have an HTML document whose charset is utf8 I am attempting
to strip all the HTML tags to create a text-only version which I am then
inserting into a column whose collation is utf8_unicode_ci. This works
for
most documents, but I have one which consists mainly of chinese
characters
which causes the following error:

Fatal Error: MySQL: Incorrect string value: '\xE6\xA8 \xE5\xA5\xBD...'

If the input string is encoded in utf8 then why can't I write it to a
utf8
column? How can the string contain invalid characters if they display
correctly in the HTML document?

MySQL does not support full Unicode, but only the 16-bit subset aka
code points U+0000 to U+FFFF aka the Basic Multilingual Plane.
Consequently MySQL rejects any UTF8 sequence representing a code point
outside the BMP. It seems this is happening here.

A significant number of chinese ideographs is not on the BMP but on
plane 2. See http://en.wikipedia.org/wiki/Unicode_plane

If you want to store Chinese in MySQL, one of those might be a better
choice: http://dev.mysql.com/doc/refman/5.1/...sian-sets.html

How do I detect that the string contains invalid characters BEFORE I attempt
to store it? The document charset is given as 'utf-8', so it is reasonable
to assume that it contains valid utf-8 characters.

BTW, my programing language is PHP.

--
Tony Marston
http://www.tonymarston.net
http://www.radicore.org

Reply With Quote
  #4  
Old   
Axel Schwenke
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - 11-18-2010 , 05:55 AM



"Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote:
Quote:
"Axel Schwenke" <axel.schwenke (AT) gmx (DOT) de> wrote in message

MySQL does not support full Unicode, but only the 16-bit subset aka
code points U+0000 to U+FFFF aka the Basic Multilingual Plane.
Consequently MySQL rejects any UTF8 sequence representing a code point
outside the BMP. It seems this is happening here.

How do I detect that the string contains invalid characters BEFORE I attempt
to store it? The document charset is given as 'utf-8', so it is reasonable
to assume that it contains valid utf-8 characters.
This is not about the validity of utf-8 sequences (though those can be
a problem too) but about the fact that utf-8 can encode code points
from outside the BMP. Those are not supported by MySQL.

see: http://en.wikipedia.org/wiki/UTF-8

Luckily (for you) any code point outside the BMP requires a 4-byte
utf-8 sequence. Hence all shorter sequences are safe for MySQL.

Quote:
BTW, my programing language is PHP.
I'm not too familiar with recent PHP development, but

http://www.php.net/manual/en/functio...k-encoding.php

looks promising. Also the first contributor note shows an alternative
implementation to check for valid utf-8 sequences. This can easily be
expanded to flag 4-byte sequences as error.


XL

Reply With Quote
  #5  
Old   
Tony Marston
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - 11-18-2010 , 09:43 AM



I have tried that function, but it never finds any utf-8 character in that
string which is greater than 3 bytes. Is it possible for any of these
shorter strings to contain values that MySQL doesn't like? How would I
detect these 'invalid' values?

--
Tony Marston
http://www.tonymarston.net
http://www.radicore.org

"Axel Schwenke" <axel.schwenke (AT) gmx (DOT) de> wrote

Quote:
"Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote:
"Axel Schwenke" <axel.schwenke (AT) gmx (DOT) de> wrote in message

MySQL does not support full Unicode, but only the 16-bit subset aka
code points U+0000 to U+FFFF aka the Basic Multilingual Plane.
Consequently MySQL rejects any UTF8 sequence representing a code point
outside the BMP. It seems this is happening here.

How do I detect that the string contains invalid characters BEFORE I
attempt
to store it? The document charset is given as 'utf-8', so it is
reasonable
to assume that it contains valid utf-8 characters.

This is not about the validity of utf-8 sequences (though those can be
a problem too) but about the fact that utf-8 can encode code points
from outside the BMP. Those are not supported by MySQL.

see: http://en.wikipedia.org/wiki/UTF-8

Luckily (for you) any code point outside the BMP requires a 4-byte
utf-8 sequence. Hence all shorter sequences are safe for MySQL.

BTW, my programing language is PHP.

I'm not too familiar with recent PHP development, but

http://www.php.net/manual/en/functio...k-encoding.php

looks promising. Also the first contributor note shows an alternative
implementation to check for valid utf-8 sequences. This can easily be
expanded to flag 4-byte sequences as error.


XL

Reply With Quote
  #6  
Old   
Tony Marston
 
Posts: n/a

Default Cannot write utf8 data into a utf8 column - 11-18-2010 , 10:54 AM



This is very strange. When I try this update in my PHP program:

UPDATE email_msg
SET size=771,
message_text='? ?:\n\n ????2010?11?25-26? ? ? ? 12?09-10? ? ? ?12?16-17?
?? ????:\n\n?
????2010???????2011????????????????????????? ????!\n\n????????????????????????????????????????? --?
? ?? ??!\n\n\n????????????:baoming_gz (AT) 126 (DOT) com??! ( ?????????!)??!
\n\n\n\n?-?-?-?:0-2-0--3-9-9-2-6-2-7-5?(0)1-3-7-1-9-0-2-5-4-7-6
??? ????\n\n?-?-?-?:baoming_gz (AT) 126 (DOT) com\n\n\n????????,?????"??"?"delete"?:tuidin01 (AT) 163 (DOT) com
??!',
revised_date='2010-11-18 15:48:09',
revised_user='AJM'
WHERE email_id='159'

it fails with the following error:

"Incorrect string value: '\xA0\xA0 \xE6\x88\x91...' for column
'message_text' at row 1"

When I try the SAME update through SQL-Front or phpMyAdmin it works! Why is
this?

--
Tony Marston
http://www.tonymarston.net
http://www.radicore.org

Reply With Quote
  #7  
Old   
Peter H. Coffin
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - 11-18-2010 , 09:25 PM



["Followup-To:" header set to comp.databases.mysql.]
On Thu, 18 Nov 2010 16:54:47 -0000, Tony Marston wrote:
Quote:
This is very strange. When I try this update in my PHP program:

UPDATE email_msg
SET size=771,
message_text='? ?:\n\n ????2010?11?25-26? ? ? ? 12?09-10? ? ? ?12?16-17?
?? ????:\n\n?
????2010???????2011????????????????????????? ????!\n\n????????????????????????????????????????? --?
? ?? ??!\n\n\n????????????:baoming_gz (AT) 126 (DOT) com??! ( ?????????!)??!
\n\n\n\n?-?-?-?:0-2-0--3-9-9-2-6-2-7-5?(0)1-3-7-1-9-0-2-5-4-7-6
??? ????\n\n?-?-?-?:baoming_gz (AT) 126 (DOT) com\n\n\n????????,?????"??"?"delete"?:tuidin01 (AT) 163 (DOT) com
??!',
revised_date='2010-11-18 15:48:09',
revised_user='AJM'
WHERE email_id='159'

it fails with the following error:

"Incorrect string value: '\xA0\xA0 \xE6\x88\x91...' for column
'message_text' at row 1"

When I try the SAME update through SQL-Front or phpMyAdmin it works! Why is
this?
If it were the SAME, it would work the same. MySQL gives this error when
there's invalid UTF-8 byte sequences, like a continuation byte without
starting byte, or a starting byte that is not followed by a continuation
byte... Find out what the hex for what you're trying to stick in
message_text and I bet it won't be what it should be.

--
69. All midwives will be banned from the realm. All babies will be
delivered at state-approved hospitals. Orphans will be placed in
foster-homes, not abandoned in the woods to be raised by creatures
of the wild. --Peter Anspach's Evil Overlord list

Reply With Quote
  #8  
Old   
Tony Marston
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - SOLVED - 11-19-2010 , 10:25 AM



"Peter H. Coffin" <hellsop (AT) ninehells (DOT) com> wrote

Quote:
["Followup-To:" header set to comp.databases.mysql.]
On Thu, 18 Nov 2010 16:54:47 -0000, Tony Marston wrote:
This is very strange. When I try this update in my PHP program:

UPDATE email_msg
SET size=771,
message_text='? ?:\n\n ????2010?11?25-26? ? ? ? 12?09-10? ? ?
?12?16-17?
?? ????:\n\n?
????2010???????2011?????????????????????????
????!\n\n????????????????????????????????????????? --?
? ?? ??!\n\n\n????????????:baoming_gz (AT) 126 (DOT) com??! ( ?????????!)??!
\n\n\n\n?-?-?-?:0-2-0--3-9-9-2-6-2-7-5?(0)1-3-7-1-9-0-2-5-4-7-6
???
????\n\n?-?-?-?:baoming_gz (AT) 126 (DOT) com\n\n\n????????,?????"??"?"delete"?:tuidin01 (AT) 163 (DOT) com
??!',
revised_date='2010-11-18 15:48:09',
revised_user='AJM'
WHERE email_id='159'

it fails with the following error:

"Incorrect string value: '\xA0\xA0 \xE6\x88\x91...' for column
'message_text' at row 1"

When I try the SAME update through SQL-Front or phpMyAdmin it works! Why
is
this?

If it were the SAME, it would work the same. MySQL gives this error when
there's invalid UTF-8 byte sequences, like a continuation byte without
starting byte, or a starting byte that is not followed by a continuation
byte... Find out what the hex for what you're trying to stick in
message_text and I bet it won't be what it should be.
The error message was reporting a problem with the hex value \xA0 (decimal
160) which represents '&nbsp;' or the non-breaking space. I discovered that
instead of replacing '&nbsp;' with chr(160) that I needed to replace it with
chr(194).chr(160). I don't now why the chr(194) is necessary, but it solves
my problem.

--
Tony Marston
http://www.tonymarston.net
http://www.radicore.org

Reply With Quote
  #9  
Old   
Michael Fesser
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - SOLVED - 11-19-2010 , 11:03 AM



..oO(Tony Marston)

Quote:
"Peter H. Coffin" <hellsop (AT) ninehells (DOT) com> wrote in message
news:slrniebqgp.1g0.hellsop (AT) abyss (DOT) ninehells.com...

If it were the SAME, it would work the same. MySQL gives this error when
there's invalid UTF-8 byte sequences, like a continuation byte without
starting byte, or a starting byte that is not followed by a continuation
byte... Find out what the hex for what you're trying to stick in
message_text and I bet it won't be what it should be.

The error message was reporting a problem with the hex value \xA0 (decimal
160) which represents '&nbsp;' or the non-breaking space.
Yes, but 0xA0 is no correct UTF-8 sequence. Only the 128 ASCII chars
(0x00-0x7F) can be written as single-byte sequences. All other chars
require 2 or more bytes.

Quote:
I discovered that
instead of replacing '&nbsp;' with chr(160) that I needed to replace it with
chr(194).chr(160). I don't now why the chr(194) is necessary, but it solves
my problem.
0xC2 0xA0 is the correct encoding of the non-breaking space as UTF-8:

http://www.fileformat.info/info/unic...r/a0/index.htm

If you do such replacements in your scripts, it might make sense to
ensure that all your scripts are stored in UTF-8 as well. Then you don't
have to manually create the correct byte sequences.

Micha

Reply With Quote
  #10  
Old   
Axel Schwenke
 
Posts: n/a

Default Re: Cannot write utf8 data into a utf8 column - 11-19-2010 , 12:03 PM



"Tony Marston" <tony (AT) NOSPAM (DOT) demon.co.uk> wrote:
Quote:
I have tried that function, but it never finds any utf-8 character in that
string which is greater than 3 bytes. Is it possible for any of these
shorter strings to contain values that MySQL doesn't like?
No. All 1, 2 or 3 byte utf-8 sequences map to characters in the BMP.

But according to your other post it seems the problem was a malformed
utf-8 sequence. 0xA0 cannot stand as first (or only) byte in such a
sequence. It can be second or third byte in a 2 or 3 byte sequence.
The Wipipedia page explains that quite nicely.

Also I wonder how such a string could pass the PHP function we
discussed. AFAIU it is the sole purpose of that function to check
all multibyte sequences for correctness.


XL

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.