dbTalk Databases Forums  

Approximate/Fuzzy Search for strings keys

comp.databases.theory comp.databases.theory


Discuss Approximate/Fuzzy Search for strings keys in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #21  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM






On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #22  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM






On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #23  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM



On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #24  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM



On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #25  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM



On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #26  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM



On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #27  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM



On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
  #28  
Old   
jefftyzzer
 
Posts: n/a

Default Re: Approximate/Fuzzy Search for strings keys - 06-20-2008 , 11:47 AM



On Jun 19, 7:47 pm, jefftyzzer <jefftyz... (AT) sbcglobal (DOT) net> wrote:
Quote:
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote:



Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here:http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here:http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML...pers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/

Regards,

--Jeff
Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/image...comparison.jpg

--Jeff


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.