![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I'm trying to use tsearch2 with database which is in 'UNICODE' encoding. It works fine for English text, but as I intend to search Polish texts I did: insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as written in manual). However, Polish-specific chars are being eaten alive, it seems. I.e. doing select to_tsvector('default_polish', body) from messages; results in list of words but with national chars stripped... I wonder, am I doing something wrong, or just tsearch2 doesn't grok Unicode, despite the locales setting? This also is a good question regarding ispell_dict and its feelings regarding Unicode, but that's another story. Assuming Unicode unsupported means I should perhaps... oh, convert the data to iso8859 prior feeding it to_tsvector()... interesting idea, but so far I have failed to actually do it. Maybe store the data as 'bytea' and add a column with encoding information (assuming I don't want to recreate whole database with new encoding, and that I want to use unicode for some columns (so I don't have to keep encoding with every text everywhere...). And while we are at it, how do you feel -- an extra column with tsvector and its index -- would it be OK to keep it away from my data (so I can safely get rid of them if need be)? [ I intend to keep index of around 2 000 000 records, few KBs of text each ]... Regards, Dawid Kuroczko ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html |
#3
| |||
| |||
|
|
-----Ursprüngliche Nachricht----- Von: pgsql-general-owner (AT) postgresql (DOT) org [mailto gsql-general-owner (AT) postgresql (DOT) org] Im Auftrag vonOleg Bartunov Gesendet: Mittwoch, 17. November 2004 17:32 An: Dawid Kuroczko Cc: Pgsql General Betreff: Re: [GENERAL] Tsearch2 and Unicode? Dawid, unfortunately, tsearch2 doesn't support unicode yet. If you keep tsvector separately from data than you'll need one more join. Oleg |
#4
| |||
| |||
|
|
Oleg, what exactly do you mean by "tsearch2 doesn't support unicode yet"? It does seem to work fine in my database, it seems: ./pg_controldata [mycluster] gives me pg_control version number: 72 [...] LC_COLLATE: de_DE.UTF-8 LC_CTYPE: de_DE.UTF-8 |
#5
| |||
| |||
|
|
Correct me if I am wrong, but I think that UTF-8 is almost identical to ISO-8859-1 in binary form to ISO-8859-1. I mean, UTF-8 is ISO-8859-1 plus multibyte characters from other charsets. |
#6
| |||
| |||
|
|
-----Ursprüngliche Nachricht----- Von: pgsql-general-owner (AT) postgresql (DOT) org [mailto gsql-general-owner (AT) postgresql (DOT) org] Im Auftrag vonDawid Kuroczko Gesendet: Mittwoch, 17. November 2004 17:17 An: Pgsql General Betreff: [GENERAL] Tsearch2 and Unicode? I'm trying to use tsearch2 with database which is in 'UNICODE' encoding. It works fine for English text, but as I intend to search Polish texts I did: insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as written in manual). However, Polish-specific chars are being eaten alive, it seems. I.e. doing select to_tsvector('default_polish', body) from messages; results in list of words but with national chars stripped... I wonder, am I doing something wrong, or just tsearch2 doesn't grok Unicode, despite the locales setting? This also is a good question regarding ispell_dict and its feelings regarding Unicode, but that's another story. Assuming Unicode unsupported means I should perhaps... oh, convert the data to iso8859 prior feeding it to_tsvector()... interesting idea, but so far I have failed to actually do it. Maybe store the data as 'bytea' and add a column with encoding information (assuming I don't want to recreate whole database with new encoding, and that I want to use unicode for some columns (so I don't have to keep encoding with every text everywhere...). And while we are at it, how do you feel -- an extra column with tsvector and its index -- would it be OK to keep it away from my data (so I can safely get rid of them if need be)? [ I intend to keep index of around 2 000 000 records, few KBs of text each ]... Regards, Dawid Kuroczko ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html |
#7
| |||
| |||
|
|
Hi! I dug through my list-archives - I actually used to have the very same problem that you described: special chars being swallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as locale, whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here: http://archives.postgresql.org/pgsql...7/msg00620.php If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that's supplied with PostgreSQL. Kind regards Markus -----Ursprüngliche Nachricht----- Von: pgsql-general-owner (AT) postgresql (DOT) org [mailto gsql-general-owner (AT) postgresql (DOT) org] Im Auftrag vonDawid Kuroczko Gesendet: Mittwoch, 17. November 2004 17:17 An: Pgsql General Betreff: [GENERAL] Tsearch2 and Unicode? I'm trying to use tsearch2 with database which is in 'UNICODE' encoding. It works fine for English text, but as I intend to search Polish texts I did: insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as written in manual). However, Polish-specific chars are being eaten alive, it seems. I.e. doing select to_tsvector('default_polish', body) from messages; results in list of words but with national chars stripped... I wonder, am I doing something wrong, or just tsearch2 doesn't grok Unicode, despite the locales setting? This also is a good question regarding ispell_dict and its feelings regarding Unicode, but that's another story. Assuming Unicode unsupported means I should perhaps... oh, convert the data to iso8859 prior feeding it to_tsvector()... interesting idea, but so far I have failed to actually do it. Maybe store the data as 'bytea' and add a column with encoding information (assuming I don't want to recreate whole database with new encoding, and that I want to use unicode for some columns (so I don't have to keep encoding with every text everywhere...). And while we are at it, how do you feel -- an extra column with tsvector and its index -- would it be OK to keep it away from my data (so I can safely get rid of them if need be)? [ I intend to keep index of around 2 000 000 records, few KBs of text each ]... Regards, Dawid Kuroczko ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your message can get through to the mailing list cleanly |
![]() |
| Thread Tools | |
| Display Modes | |
| |