![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hi all, Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL. I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no matter if you include all the words or even misspell them. MySQL's fulltext-search is a first step, but it doesn't cover mis- spellings. My question is this: suppose I broke down every movie title into bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a relationship between every bigram and every movie title that includes it in a separate table, BIGRAMS. (Will have two fields, "bigram" and "movie_id"). I break down every search string into bigrams. Then, for every search, I retrieve all the rows from BIGRAMS that include any of the bigrams in the search string. I then calculate COUNT(movie_id) ORDER BY DESC and take the top matches. Comments? Anybody have experience with this? Does it work? Or would it be far too slow? Is there another better solution I'm missing? We're talking 200,000 movies, which means BIGRAMS might have 4 million rows or so. Thanks Michael |
#3
| |||
| |||
|
|
crazyhorse wrote: Hi all, Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL. I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no matter if you include all the words or even misspell them. MySQL's fulltext-search is a first step, but it doesn't cover mis- spellings. My question is this: suppose I broke down every movie title into bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a relationship between every bigram and every movie title that includes it in a separate table, BIGRAMS. (Will have two fields, "bigram" and "movie_id"). I break down every search string into bigrams. Then, for every search, I retrieve all the rows from BIGRAMS that include any of the bigrams in the search string. I then calculate COUNT(movie_id) ORDER BY DESC and take the top matches. Comments? Anybody have experience with this? Does it work? Or would it be far too slow? Is there another better solution I'm missing? We're talking 200,000 movies, which means BIGRAMS might have 4 million rows or so. Thanks Michael http://en.wikipedia.org/wiki/Stemming |
#4
| |||
| |||
|
|
Hi all, Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL. I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no matter if you include all the words or even misspell them. MySQL's fulltext-search is a first step, but it doesn't cover mis- spellings. My question is this: suppose I broke down every movie title into bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a relationship between every bigram and every movie title that includes it in a separate table, BIGRAMS. (Will have two fields, "bigram" and "movie_id"). I break down every search string into bigrams. Then, for every search, I retrieve all the rows from BIGRAMS that include any of the bigrams in the search string. I then calculate COUNT(movie_id) ORDER BY DESC and take the top matches. Comments? Anybody have experience with this? Does it work? Or would it be far too slow? Is there another better solution I'm missing? We're talking 200,000 movies, which means BIGRAMS might have 4 million rows or so. Thanks Michael |
#5
| |||
| |||
|
|
Hi all, Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL. I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no matter if you include all the words or even misspell them. MySQL's fulltext-search is a first step, but it doesn't cover mis- spellings. My question is this: suppose I broke down every movie title into bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a relationship between every bigram and every movie title that includes it in a separate table, BIGRAMS. (Will have two fields, "bigram" and "movie_id"). I break down every search string into bigrams. Then, for every search, I retrieve all the rows from BIGRAMS that include any of the bigrams in the search string. I then calculate COUNT(movie_id) ORDER BY DESC and take the top matches. Comments? Anybody have experience with this? Does it work? Or would it be far too slow? Is there another better solution I'm missing? We're talking 200,000 movies, which means BIGRAMS might have 4 million rows or so. Thanks Michael |
#6
| |||
| |||
|
|
Hi, One common approach to deal with query terms mispellings in Information Retrieval systems is to employ some sort of string distance technique, whose most common approach is described in [http:// en.wikipedia.org/wiki/Levenshtein_distance]. Techniques like that can identify, by syntax, the nearest terms, similarly to suggestions given by spell checkers of word processors. I think that using stemming in conjunction with string distance can satisfy your requirements. The really hard part would be to put that functionalities inside MySQL.... []'s Diogo On Oct 24, 8:03 pm, crazyhorse <mjbald... (AT) yahoo (DOT) com> wrote: Hi all, Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL. I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no matter if you include all the words or even misspell them. MySQL's fulltext-search is a first step, but it doesn't cover mis- spellings. My question is this: suppose I broke down every movie title into bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a relationship between every bigram and every movie title that includes it in a separate table, BIGRAMS. (Will have two fields, "bigram" and "movie_id"). I break down every search string into bigrams. Then, for every search, I retrieve all the rows from BIGRAMS that include any of the bigrams in the search string. I then calculate COUNT(movie_id) ORDER BY DESC and take the top matches. Comments? Anybody have experience with this? Does it work? Or would it be far too slow? Is there another better solution I'm missing? We're talking 200,000 movies, which means BIGRAMS might have 4 million rows or so. Thanks Michael |
#7
| |||
| |||
|
|
I think the string distance is too expensive to compute in a database, and again, stemming is not really what I need. For this application, it's also not just misspelled words -- it's skipping a word in the movie title, or using an alternate form (i.e. "Stephen" or "Steven"), or specifying a longer version when we only have a shorter title in our database. I'm looking for a fuzzy, flexible search in general that can be implemented in a database. No real strong ideas for this, huh? |
![]() |
| Thread Tools | |
| Display Modes | |
| |