![]() | |
![]() |
| | Thread Tools | Display Modes |
#21
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#22
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#23
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#24
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#25
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#26
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#27
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
#28
| |||
| |||
|
|
On Jun 19, 9:36 am, walid.s... (AT) gmail (DOT) com wrote: Dear all, I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index). I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear? I thank you in advance... Walid, I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter. I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising. There are scores of excellent papers out there, but two of my current favorites are: "A Comparison of String Distance Metrics for Name-Matching Tasks" available here:http://tinyurl.com/4b8jao --and-- "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:http://tinyurl.com/3nt3vj Here are links to some researchers that are leaders in this area: http://www.cs.umd.edu/~indrajit/HTML...pers/index.php Also, these books may help: _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson Finally, some software: http://www.dcs.shef.ac.uk/~sam/strin...urceforge.net/ Regards, --Jeff |
![]() |
| Thread Tools | |
| Display Modes | |
| |