dbTalk Databases Forums  

"Fuzzy" text search using n-grams (bigrams) -- speed?

comp.databases.theory comp.databases.theory


Discuss "Fuzzy" text search using n-grams (bigrams) -- speed? in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
crazyhorse
 
Posts: n/a

Default "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-24-2007 , 06:03 PM






Hi all,

Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.

I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.

MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.

My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").

I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.

Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.

Thanks
Michael


Reply With Quote
  #2  
Old   
Bob Badour
 
Posts: n/a

Default Re: "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-24-2007 , 06:48 PM






crazyhorse wrote:
Quote:
Hi all,

Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.

I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.

MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.

My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").

I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.

Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.

Thanks
Michael
http://en.wikipedia.org/wiki/Stemming


Reply With Quote
  #3  
Old   
crazyhorse
 
Posts: n/a

Default Re: "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-24-2007 , 11:53 PM



On Oct 24, 9:48 pm, Bob Badour <bbad... (AT) pei (DOT) sympatico.ca> wrote:
Quote:
crazyhorse wrote:
Hi all,

Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.

I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.

MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.

My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").

I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.

Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.

Thanks
Michael

http://en.wikipedia.org/wiki/Stemming
That doesn't really help... I'm trying to compensate for spelling
errors, not different grammatical forms of words...



Reply With Quote
  #4  
Old   
David Cressey
 
Posts: n/a

Default Re: "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-25-2007 , 03:12 AM




"crazyhorse" <mjbaldwin (AT) yahoo (DOT) com> wrote

Quote:
Hi all,

Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.

I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.

MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.

My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").

I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.

Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.

Thanks
Michael

If you do a search on "Soundex" you will come up with some tools for
detecting near misses that sound right.

The most common typo is swapping two letters, and this kind of error would
probably defeat a pure soundex implementation. But I wouldn't be surprised
if sites that offer a soundex tool might offer what I'll call a "typex
tool", for lack of a better name.




Reply With Quote
  #5  
Old   
diogomartins
 
Posts: n/a

Default Re: "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-25-2007 , 07:05 AM



Hi,

One common approach to deal with query terms mispellings in
Information Retrieval systems is to employ some sort of string
distance technique, whose most common approach is described in [http://
en.wikipedia.org/wiki/Levenshtein_distance]. Techniques like that can
identify, by syntax, the nearest terms, similarly to suggestions given
by spell checkers of word processors. I think that using stemming in
conjunction with string distance can satisfy your requirements. The
really hard part would be to put that functionalities inside MySQL....

[]'s

Diogo


On Oct 24, 8:03 pm, crazyhorse <mjbald... (AT) yahoo (DOT) com> wrote:
Quote:
Hi all,

Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.

I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.

MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.

My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").

I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.

Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.

Thanks
Michael



Reply With Quote
  #6  
Old   
crazyhorse
 
Posts: n/a

Default Re: "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-25-2007 , 09:53 AM



On Oct 25, 10:05 am, diogomartins <diogomart... (AT) gmail (DOT) com> wrote:
Quote:
Hi,

One common approach to deal with query terms mispellings in
Information Retrieval systems is to employ some sort of string
distance technique, whose most common approach is described in [http://
en.wikipedia.org/wiki/Levenshtein_distance]. Techniques like that can
identify, by syntax, the nearest terms, similarly to suggestions given
by spell checkers of word processors. I think that using stemming in
conjunction with string distance can satisfy your requirements. The
really hard part would be to put that functionalities inside MySQL....

[]'s

Diogo

On Oct 24, 8:03 pm, crazyhorse <mjbald... (AT) yahoo (DOT) com> wrote:

Hi all,

Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.

I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.

MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.

My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").

I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.

Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.

Thanks
Michael

I think the string distance is too expensive to compute in a database,
and again, stemming is not really what I need.

For this application, it's also not just misspelled words -- it's
skipping a word in the movie title, or using an alternate form (i.e.
"Stephen" or "Steven"), or specifying a longer version when we only
have a shorter title in our database. I'm looking for a fuzzy,
flexible search in general that can be implemented in a database.

No real strong ideas for this, huh?



Reply With Quote
  #7  
Old   
Patricia Shanahan
 
Posts: n/a

Default Re: "Fuzzy" text search using n-grams (bigrams) -- speed? - 10-25-2007 , 10:06 AM



crazyhorse wrote:
....
Quote:
I think the string distance is too expensive to compute in a database,
and again, stemming is not really what I need.

For this application, it's also not just misspelled words -- it's
skipping a word in the movie title, or using an alternate form (i.e.
"Stephen" or "Steven"), or specifying a longer version when we only
have a shorter title in our database. I'm looking for a fuzzy,
flexible search in general that can be implemented in a database.

No real strong ideas for this, huh?

How about a multi-layer approach?

1. Look up the actual words against the actual movie titles. That will
pick up deliberately and distinctively unusual names and spellings. For
example, it will match "shrek".

2. Apply one or more layers of normalization, and check against a
normalized version of the titles.

For example, reduce all forms of names with multiple common forms to a
single form. Do spelling correction with a US English dictionary. Delete
articles ("the","a" ...).

Patricia


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.