dbTalk Databases Forums  

Full text search

comp.databases.postgresql comp.databases.postgresql


Discuss Full text search in the comp.databases.postgresql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Mladen Gogala
 
Posts: n/a

Default Full text search - 02-10-2010 , 01:04 PM






Where can I find the list of separator characters for the configuration
named "english"? In other words, I need the list of characters which
delimit words.



--
http://mgogala.freehostia.com

Reply With Quote
  #2  
Old   
Mladen Gogala
 
Posts: n/a

Default Re: Full text search - 02-10-2010 , 04:12 PM






On Wed, 10 Feb 2010 18:04:08 +0000, Mladen Gogala wrote:

Quote:
Where can I find the list of separator characters for the configuration
named "english"? In other words, I need the list of characters which
delimit words.
Also, is there a way to use text search and search for phrases? Something
like "chicken salad" should not return "fried chicken with potato salad"?



--
http://mgogala.freehostia.com

Reply With Quote
  #3  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Full text search - 02-11-2010 , 09:30 AM



Mladen Gogala wrote:
Quote:
On Wed, 10 Feb 2010 18:04:08 +0000, Mladen Gogala wrote:

Where can I find the list of separator characters for the configuration
named "english"? In other words, I need the list of characters which
delimit words.

Also, is there a way to use text search and search for phrases? Something
like "chicken salad" should not return "fried chicken with potato salad"?
About separator characters, that is more complicated than that.

First the parser is invoked to create tokens, which are then run
through the dictionary.

Consider this example:

test=> SELECT alias, '>' || token || '<' AS token, lexemes
test-> FROM ts_debug('english', 'Examples: for "various" nilly-willy tökens');

alias | token | lexemes
-----------------+---------------+---------------
asciiword | >Examples< | {exampl}
blank | >: < |
asciiword | >for< | {}
blank | > "< |
asciiword | >various< | {various}
blank | >" < |
asciihword | >nilly-willy< | {nilly-willi}
hword_asciipart | >nilly< | {nilli}
blank | >-< |
hword_asciipart | >willy< | {willi}
blank | > < |
word | >tökens< | {töken}
(12 rows)

"token" is the output of the parser, lexemes is what the dictionary
makes of that.
Would you say that "-" delimits words or not?
Maybe it's better to say that certain characters delimit certain token types.

I guess you want to know which characters make "blank" tokens.

Unless you want to dig into the code, I'd say, experiment with
queries like

SELECT alias FROM ts_debug('@');

About searching for a phrase, you can use the ranking function "ts_rank_cd"
with normalization 4 which will calculate the "mean harmonic distance"
between the matches.

The higher the rank, the closer they are together. Order is irrelevant,
so it is not exactly what you want, but it should help.

Compare

test=> WITH vals AS (
test-> SELECT to_tsvector('english', 'fried chicken with potato salad') AS searchvector,
test-> to_tsquery('english', 'chicken & salad') AS query
test-> ) SELECT searchvector @@ query,
test-> ts_rank_cd(searchvector, query, 4)
test-> FROM vals;

?column? | ts_rank_cd
----------+------------
t | 0.0333333
(1 row)

and

test=> WITH vals AS (
test-> SELECT to_tsvector('english', 'fried chicken with potato salad') AS searchvector,
test-> to_tsquery('english', 'chicken & fried') AS query
test-> ) SELECT searchvector @@ query,
test-> ts_rank_cd(searchvector, query, 4)
test-> FROM vals;

?column? | ts_rank_cd
----------+------------
t | 0.1
(1 row)

Yours,
Laurenz Albe

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.