dbTalk Databases Forums  

invalid byte sequence for encoding "UTF8"

comp.databases.postgresql comp.databases.postgresql


Discuss invalid byte sequence for encoding "UTF8" in the comp.databases.postgresql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Fuzzydave
 
Posts: n/a

Default invalid byte sequence for encoding "UTF8" - 03-21-2007 , 02:12 AM






Hi,

Please before I start remember that I am not a DBA (much as I
enjoyworking with PostgreSQL and learnign to admin it) I am a
developer who has had the job dropped on his desk as the person
with the most similar skills (and the only person willing to try and
learn it).

--The Background--
My company have bought a new Server which our Linux admin
has set up with Red Hat Enterprise Linux 4 and PostgreSQL
8.1.8(UTF8).
Our old server was the same OS but PostgreSQL 8.1.3(UTF8). I wrote
a backup script for old DB Server running PG 8.1.3(UTF8) which
preformed a pg_dumpall on the entire schema (to include logins,
functions etc) and an individual backup (pg_dump) of all the seperate
schemas within that database as a series of SQL statements, the
theory being we could just point the .sql files as psql command line
and let psql run the .sql files so that not only do we have a back up
but could point it at other db servers we have for a easy restore of
information to a new db (like our recently build development server
also running PostgreSQL 8.1.3(UTF8). When we tried the restore
from the current like server to the dev server both on 8.1.3(UTF8)
the restore worked flawlessly, however when we ran the restore to
populate our new server (current live 8.1.3(UTF8) to prospective new
live db server 8.1.8(UTF8)) it started throwing back the errors
listed
at the bottom of this email, I am aware however that before I joined
the company the live DB previous to our 8.1.3(UTF8) was 7.?.?(ASCII)
and while the databases encoding format was changed there was
nothing done to the data, though I beleived this wouldn't be an
issue as I though the base UTF8 encoded characters where the
same as the ASCII collection.

After a bit of googling and searching postgres.org I found that there
seemed to be a problem with escape characters which seemed to
return very similar errors to the ones I have been receiving. I
suggested using the iconv command to clean the files of the bad
data, and I did attempt this but with the smallest of the two
backup files being 22GB of data the command returned a file is to
big error. So I am left with 2 backup files one at 22GB and one
around 66GB (raw data from a flat file) that I need to restore
somehow.

I thought about dumping the DB in binary but wasn't sure if the
problem would persist in the format as it did in .sql format, I am
looking for any help or pointers in the right direction.

Sorry if i have went on a bit but I thought it best to give all the
information I can on it.

*confused*
David P

Cromwell Tools Uk

--- Companys Main Application ---

--- Restore Started At Tue Mar 20 16:11:40 GMT 2007 : ---

SET
SET
SET
SET
ALTER TABLE

psql:/mnt/tmp/app/application_data.sql:17: ERROR: duplicate key
violates unique constraint "a_table_key"
CONTEXT: COPY a_table, line 1: "2006-06-01 00:54:49.470988+01"
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
psql:/mnt/tmp/app/application_data.sql:97425: ERROR: invalid byte
sequence for encoding "UTF8": 0xff
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY another_table, line 459
ALTER TABLE
ALTER TABLE
psql:/mnt/tmp/app/application_data.sql:145410: ERROR: invalid byte
sequence for encoding "UTF8": 0xa1
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY more_tables, line 5111
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
psql:/mnt/tmp/app/application_data.sql:419469: ERROR: invalid byte
sequence for encoding "UTF8": 0xac
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY still_more_tables, line 74334
ALTER TABLE
ALTER TABLE


Reply With Quote
  #2  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: invalid byte sequence for encoding "UTF8" - 03-21-2007 , 05:11 AM






Fuzzydave <dav.phillips (AT) ntlworld (DOT) com> wrote:
Quote:
My company have bought a new Server which our Linux admin
has set up with Red Hat Enterprise Linux 4 and PostgreSQL
8.1.8(UTF8).
Our old server was the same OS but PostgreSQL 8.1.3(UTF8). I wrote
a backup script for old DB Server running PG 8.1.3(UTF8) which
preformed a pg_dumpall on the entire schema (to include logins,
functions etc) and an individual backup (pg_dump) of all the seperate
schemas within that database as a series of SQL statements
[...]
When we tried the restore
from the current like server to the dev server both on 8.1.3(UTF8)
the restore worked flawlessly, however when we ran the restore to
populate our new server (current live 8.1.3(UTF8) to prospective new
live db server 8.1.8(UTF8)) it started throwing back the errors
listed at the bottom of this email
[...]

There are incorrect bytes in your database.

I am not sure why the restore works in 8.1.3, but perhaps it has to do
with the following remark in the release nots of 8.1.4:

# Change the server to reject invalidly-encoded multibyte characters
in all cases (Tatsuo, Tom)

While PostgreSQL has been moving in this direction for some time,
the checks are now applied uniformly to all encodings and all textual
input, and are now always errors not merely warnings. This change defends
against SQL-injection attacks of the type described in CVE-2006-2313.

Quote:
I suggested using the iconv command to clean the files of the bad
data, and I did attempt this but [...]
[...]

I think that iconv is the wrong tool anyway, because it converts
(correct) data from one encoding to another and will fail if it encounters
incorrect data.

Quote:
I thought about dumping the DB in binary but wasn't sure if the
problem would persist in the format as it did in .sql format, I am
looking for any help or pointers in the right direction.
[...]

Using the binary format will probably not help.

I think that, rather than attempting to somehow get the dumps into the
new database, you should try to fix the bad data.

Do you get many such error messages or only a handful?
In the latter case, maybe the easiest solution is to identify the
rows at fault and manually change the data in the database before
dumping it.

If there are too many corrupt rows, maybe it would be better to
find a way to automatically fix the dump.

Do you know which characters should be there in place of the invalid
bytes 0xFF, 0xA1, and 0xAC?

If yes, you could use a command like

tr '\377\241\254' 'xyz' <baddump >gooddump

to translate them into x, y, and z respectively (provided that x, y,
and z are 7 bit ASCII characters).

Quote:
psql:/mnt/tmp/app/application_data.sql:17: ERROR: duplicate key
violates unique constraint "a_table_key"
CONTEXT: COPY a_table, line 1: "2006-06-01 00:54:49.470988+01"
But this error is mysterious. Are the keys identical? Why does 8.1.3
not fail on this?

Yours,
Laurenz Albe


Reply With Quote
  #3  
Old   
Fuzzydave
 
Posts: n/a

Default Re: invalid byte sequence for encoding "UTF8" - 03-22-2007 , 03:25 AM



On Mar 21, 11:11 am, Laurenz Albe <inv... (AT) spam (DOT) to.invalid> wrote:
Quote:
There are incorrect bytes in your database.
I thought that might be the case, when I reilised that we where
running
8.1.3 and the security patch for 8.1.4 came out in march last year
after
our last DBA left it hadn't had the security patch applied (due to the
fact
none of us left knew enough about postgres).

Quote:
I am not sure why the restore works in 8.1.3, but perhaps it has to do
with the following remark in the release nots of 8.1.4:
# Change the server to reject invalidly-encoded multibyte characters
in all cases (Tatsuo, Tom)
While PostgreSQL has been moving in this direction for some time,
the checks are now applied uniformly to all encodings and all textual
input, and are now always errors not merely warnings. This change defends
against SQL-injection attacks of the type described in CVE-2006-2313.
Yeah, I found that but wasn't sure how to deal with the data that
already
had improperly encoded bytes in it, I read through it and am still a
little
uncertian of how the second byte in the string gets improperly
encoded
though it seems to be something to do with the backslash being
delimiting
quotes as it then means you can trail the quote with instructions to
allow
SQL Injection to occur, so I assume that when this is done but not
with
any injection it enters a blank space or a character which gives a
bad
byte (I might be way off as I don't know much about unicode and am
waiting for a book to come through).

Quote:
I think that iconv is the wrong tool anyway, because it converts
(correct) data from one encoding to another and will fail if it encounters
incorrect data.
It wouldn't operate on the file anyway which is just as well if its
the wrong
tool I guess

Quote:
I think that, rather than attempting to somehow get the dumps into the
new database, you should try to fix the bad data.
Well we need import the data from flat files written in cobol (a
legacy
system) and we need to set this up anyway and test it for the new
server
the problem being I can identify what the characters are that are
throwing
off the db I know where the souce of the corrupt data is and where it
ends
up but due to the way the businuss system work I can indentify the
place
the error occurs. I guess if i can't identify it I'll just need to try
trimming the
bad bytes off the start of the string and see if something breaks

Quote:
Do you get many such error messages or only a handful?
In the latter case, maybe the easiest solution is to identify the
rows at fault and manually change the data in the database before
dumping it.
there about 2 dozen rows which isn't many but I can't identify what
character unicode is trying to add in as it returns <?> the question
mark in a black triangle and depending what interface I am using
(pgadmin/phppgadmin) I can see the black triangle with question mark
or nothing at all.

Quote:
Do you know which characters should be there in place of the invalid
bytes 0xFF, 0xA1, and 0xAC?
I haven't been able to identify what these characters should be in
the
businuss system but I am beinging to think it might just be junk, as
I said our postgres data is transformed from raw data in a flat
squencial
file and it might just be caused by bad interpritation. I guess it
makes
sense to take it from the source and try to cope in the initial data
transfer
rather than deal with it in the database.

Quote:
But this error is mysterious. Are the keys identical? Why does 8.1.3
not fail on this?
I assume its because the security patch for CVE-2006-2313 and the
proceeding similar error was patch in 8.1.4 and we were running
unpatched 8.1.3.

Thanks for you help and comments

David P



Reply With Quote
  #4  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: invalid byte sequence for encoding "UTF8" - 03-22-2007 , 04:33 AM



Fuzzydave <dav.phillips (AT) ntlworld (DOT) com> wrote:
Quote:
There are incorrect bytes in your database.

[...] though it seems to be something to do with the backslash being
delimiting quotes as it then means you can trail the quote with
instructions to allow SQL Injection to occur [...]
I must admit that I did not totally understand what you are saying here,
and in general a good deal of your reply is hard to understand for me,
but I don't believe that there is any connection to SQL injection
(which is a way of attacking a database program).

To get incorrect data into a UTF8 PostgreSQL database, all you have to
do is to

INSERT ... VALUES (E'This text has \377 in it')

Quote:
Do you get many such error messages or only a handful?

there about 2 dozen rows which isn't many but I can't identify what
character unicode is trying to add in as it returns <?> the question
mark in a black triangle and depending what interface I am using
(pgadmin/phppgadmin) I can see the black triangle with question mark
or nothing at all.

Do you know which characters should be there in place of the invalid
bytes 0xFF, 0xA1, and 0xAC?

I haven't been able to identify what these characters should be in the
businuss system but I am beinging to think it might just be junk [...]
In that case my suggestion is to replace the offending bytes in the dump
with a random character, perhaps with the 'tr' command as I have suggested.

2 dozen rows is not a lot, and once you have successfully loaded the fixed
dump into your new database, you can identify the rows and correct them
manually. It will make your work much easier that psql gives you the
line numbers of the lines with the incorrect bytes in them in the error
message!

[one error message is about a violated constraint]

Quote:
But this error is mysterious. Are the keys identical? Why does 8.1.3
not fail on this?

I assume its because the security patch for CVE-2006-2313 and the
proceeding similar error was patch in 8.1.4 and we were running
unpatched 8.1.3.
I don't believe that.

I'd be seriously surprised if you got no error message with 8.1.3
provided that the structure and contents of the database are identical.

I'd examine the record PostgreSQL complains about and look for duplicate
keys in the database.

Yours,
Laurenz Albe


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.