On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog (AT) gmail (DOT) com wrote:
Quote:
The following bug has been logged on the website:
Bug reference: 6375
Logged by: Valentine Gogichashvili
Email address: valgog (AT) gmail (DOT) com
PostgreSQL version: 9.1.1
Operating system: Debian 4.4.5-8
Description:
Hello,
default tsearch parser does not recognize all valid email addresses and
tokenizes them as text, splitting into tokens.
For example:
postgres=# select to_tsquery('simple', 'normal (AT) email (DOT) com' );
to_tsquery
────────────────┠€â”€â”€â”€
'normal (AT) email (DOT) com'
(1 row)
here it behaves ok;
postgres=# select to_tsquery('simple', '-still-normal (AT) email (DOT) com' );
to_tsquery
────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€
'still-normal (AT) email (DOT) com'
(1 row)
here it trims '-' from the beginning of an email. This is not correct, but
will at least find that email.
postgres=# select to_tsquery('simple', '-not-normal-with-dash- (AT) email (DOT) com'
);
to_tsquery
────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ ────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
(1 row)
and this is now a real problem as it leads to finding emails that are not
the same, but are "super-sets" of that one.
Valid email characters, that are not correctly treated also are at least '+'
and '.' |
Yep. :-(
You can see the oddness here:
test=> SELECT alias, description, token FROM ts_debug('-myname (AT) gmail (DOT) com');
alias | description | token
-------+---------------+------------------
blank | Space symbols | -
email | Email address | myname (AT) gmail (DOT) com
(2 rows)
test=> SELECT alias, description, token FROM ts_debug('-myna-me (AT) gmail (DOT) com');
alias | description | token
-------+---------------+-------------------
blank | Space symbols | -
email | Email address | myna-me (AT) gmail (DOT) com
(2 rows)
test=> SELECT alias, description, token FROM ts_debug('-myna-me- (AT) gmail (DOT) com');
alias | description | token
-----------------+---------------------------------+-----------
blank | Space symbols | -
asciihword | Hyphenated word, all ASCII | myna-me
hword_asciipart | Hyphenated word part, all ASCII | myna
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | me
blank | Space symbols | -@
host | Host | gmail.com
(7 rows)
The first and second show that the leading-dash is separated. The third
ones shows that a trailing dash causes the middle-dash to also be
separated.
This email thread from 2010 has a similar problem:
http://archives.postgresql.org/pgsql...0/msg00772.php
What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.
I have added your email to the existing TODO item:
http://wiki.postgresql.org/wiki/Todo#Text_Search
Improve handling of dash and plus signs in email address user names, and
perhaps improve URL parsing
http://archives.postgresql.org/pgsql...0/msg00772.php
tsearch does not recognize all valid emails
--
Bruce Momjian <bruce (AT) momjian (DOT) us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs