dbTalk Databases Forums  

[BUGS] BUG #6375: tsearch does not recognize all valid emails

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss [BUGS] BUG #6375: tsearch does not recognize all valid emails in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
valgog@gmail.com
 
Posts: n/a

Default [BUGS] BUG #6375: tsearch does not recognize all valid emails - 01-03-2012 , 12:04 PM






The following bug has been logged on the website:

Bug reference: 6375
Logged by: Valentine Gogichashvili
Email address: valgog (AT) gmail (DOT) com
PostgreSQL version: 9.1.1
Operating system: Debian 4.4.5-8
Description:

Hello,

default tsearch parser does not recognize all valid email addresses and
tokenizes them as text, splitting into tokens.

For example:

postgres=# select to_tsquery('simple', 'normal (AT) email (DOT) com' );
to_tsquery
────────────────┠€â”€â”€â”€
'normal (AT) email (DOT) com'
(1 row)

here it behaves ok;

postgres=# select to_tsquery('simple', '-still-normal (AT) email (DOT) com' );
to_tsquery
────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€
'still-normal (AT) email (DOT) com'
(1 row)

here it trims '-' from the beginning of an email. This is not correct, but
will at least find that email.

postgres=# select to_tsquery('simple', '-not-normal-with-dash- (AT) email (DOT) com'
);
to_tsquery

────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ ────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
(1 row)

and this is now a real problem as it leads to finding emails that are not
the same, but are "super-sets" of that one.

Valid email characters, that are not correctly treated also are at least '+'
and '.'

With my best regards,

-- Valentine Gogichashvili


--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
  #2  
Old   
Bruce Momjian
 
Posts: n/a

Default Re: [BUGS] BUG #6375: tsearch does not recognize all valid emails - 02-07-2012 , 11:41 AM






On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog (AT) gmail (DOT) com wrote:
Quote:
The following bug has been logged on the website:

Bug reference: 6375
Logged by: Valentine Gogichashvili
Email address: valgog (AT) gmail (DOT) com
PostgreSQL version: 9.1.1
Operating system: Debian 4.4.5-8
Description:

Hello,

default tsearch parser does not recognize all valid email addresses and
tokenizes them as text, splitting into tokens.

For example:

postgres=# select to_tsquery('simple', 'normal (AT) email (DOT) com' );
to_tsquery
────────────────┠€â”€â”€â”€
'normal (AT) email (DOT) com'
(1 row)

here it behaves ok;

postgres=# select to_tsquery('simple', '-still-normal (AT) email (DOT) com' );
to_tsquery
────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€
'still-normal (AT) email (DOT) com'
(1 row)

here it trims '-' from the beginning of an email. This is not correct, but
will at least find that email.

postgres=# select to_tsquery('simple', '-not-normal-with-dash- (AT) email (DOT) com'
);
to_tsquery

────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ ────────────────┠€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
(1 row)

and this is now a real problem as it leads to finding emails that are not
the same, but are "super-sets" of that one.

Valid email characters, that are not correctly treated also are at least '+'
and '.'
Yep. :-(

You can see the oddness here:

test=> SELECT alias, description, token FROM ts_debug('-myname (AT) gmail (DOT) com');
alias | description | token
-------+---------------+------------------
blank | Space symbols | -
email | Email address | myname (AT) gmail (DOT) com
(2 rows)

test=> SELECT alias, description, token FROM ts_debug('-myna-me (AT) gmail (DOT) com');
alias | description | token
-------+---------------+-------------------
blank | Space symbols | -
email | Email address | myna-me (AT) gmail (DOT) com
(2 rows)

test=> SELECT alias, description, token FROM ts_debug('-myna-me- (AT) gmail (DOT) com');
alias | description | token
-----------------+---------------------------------+-----------
blank | Space symbols | -
asciihword | Hyphenated word, all ASCII | myna-me
hword_asciipart | Hyphenated word part, all ASCII | myna
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | me
blank | Space symbols | -@
host | Host | gmail.com
(7 rows)

The first and second show that the leading-dash is separated. The third
ones shows that a trailing dash causes the middle-dash to also be
separated.

This email thread from 2010 has a similar problem:

http://archives.postgresql.org/pgsql...0/msg00772.php

What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.

I have added your email to the existing TODO item:

http://wiki.postgresql.org/wiki/Todo#Text_Search

Improve handling of dash and plus signs in email address user names, and
perhaps improve URL parsing

http://archives.postgresql.org/pgsql...0/msg00772.php
tsearch does not recognize all valid emails

--
Bruce Momjian <bruce (AT) momjian (DOT) us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.