dbTalk Databases Forums  

Approximate String Matching in RDBMS

comp.databases.theory comp.databases.theory


Discuss Approximate String Matching in RDBMS in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
PRAGMATECH
 
Posts: n/a

Default Approximate String Matching in RDBMS - 01-19-2011 , 09:07 AM






Dear All,

Does someone know what is the best theoretical complexity in the
literature, as O(f(N)), where N is the number of rows in a table t
and
where approximate string matching on some field in t is to be done.

I am not asking about the complexity of a single approximate match
between 2 strings with lengths n and m, but on the cost of the lookup
over the entire table.
In other words, what is the cost of the following, in terms of N, the
number of rows in t:

SELECT * FROM t WHERE APPROXIMATELY_EQUAL(t.a, threshold)

I thank you in advance for any feedback.

Best Regards,
Walid Saba

Reply With Quote
  #2  
Old   
Nilone
 
Posts: n/a

Default Re: Approximate String Matching in RDBMS - 01-20-2011 , 10:57 AM






On Jan 19, 5:07*pm, PRAGMATECH <walid.s... (AT) gmail (DOT) com> wrote:
Quote:
Dear All,

Does someone know what is the best theoretical complexity in the
literature, as O(f(N)), where N is the number of rows in a table t
and
where approximate string matching on some field in t is to be done.

I am not asking about the complexity of a single approximate match
between 2 strings with lengths n and m, but on the cost of the lookup
over the entire table.
In other words, what is the cost of the following, in terms of N, the
number of rows in t:

SELECT * FROM t WHERE APPROXIMATELY_EQUAL(t.a, threshold)

I thank you in advance for any feedback.

Best Regards,
Walid Saba
While looking into your question, I came across this:
http://blog.notdot.net/2010/07/Damn-...htein-Automata

I didn't see the answer to your question, but I hope this helps.

Reply With Quote
  #3  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate String Matching in RDBMS - 02-01-2011 , 01:07 PM



On Jan 19, 7:07*am, PRAGMATECH <walid.s... (AT) gmail (DOT) com> wrote:
Quote:
Dear All,

Does someone know what is the best theoretical complexity in the
literature, as O(f(N)), where N is the number of rows in a table t
and
where approximate string matching on some field in t is to be done.

I am not asking about the complexity of a single approximate match
between 2 strings with lengths n and m, but on the cost of the lookup
over the entire table.
In other words, what is the cost of the following, in terms of N, the
number of rows in t:

SELECT * FROM t WHERE APPROXIMATELY_EQUAL(t.a, threshold)

I thank you in advance for any feedback.

Best Regards,
Walid Saba
Hi, Walid.

I may be misunderstanding your question, so apologies if that's the
case. The naive but arguably most straightforward method of
approximate matching is pair-wise, i.e., a self join and test of the
two comparands against a threshold, e.g.:

SELECT t1.key1, t2.key1
FROM t t1, t t2
WHERE APPROXIMATELY_EQUAL(t1.a, t2.a) >= threshold

APPROXIMATELY_EQUAL could be Levenshtein, Jaro-Winkler, Jaccard,
Cosine, or The New Walid Algorithm, among others.

Such a query realizes a Cartesian product, and the cost of comparing
each row to every other row is a quadratic time process, i.e., N^2.

You can reduce the potential size of this result set by adding an
additional predicate to filter out rows joined to themselves
(reflexive rows), as well as reciprocals (symmetric rows), such as
"t1.key1 > t2.key1." Thus, the number of theoretically matched and
*returned* rows is reduced from N^2 to N(N-1)/2.

There are oodles of algorithms out there that describe methods to
reduce the complexity of the pair-wise comparison process. One of the
most venerable is the sorted neighborhood method. Its cost is Nlog2N,
with N being the size of the sliding window (the "neighborhood").

--Jeff

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.