![]() | |
![]() |
| | Thread Tools | Display Modes |
#41
| |||
| |||
|
|
"DA Morgan" <damorgan (AT) psoug (DOT) org> schreef in bericht news:1219086595.356745 (AT) bubbleator (DOT) drizzle.com... merkury wrote: Hi Roelof Schierbeek, Thank you for your reply. I am sorry but your solution will not work on our datasets. The first 15 characters may occur in each string even if they are not "near". What I need is, that many words must coincide and many 2word or 3word combinations (maybe even more) should be the same. I guess this is more likely to fit our needs. I suppose de-duplication is only relevant for duplicates not for "near" duplicates. So our problem is to find for each dataset all nearest neighbours. We (will) have 20 million datasets. This means a comparison of 20million times 20million comparisons, if there is no better approach. Thanks merkury R. Schierbeek schrieb: Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury Look at the UTL_MATCH built in package. It contains an API for both the JARO WANKLER and LEVENSHTEIN algorithms. -- Daniel A. Morgan Oracle Ace Director & Instructor University of Washington damorgan@x.washington.edu (replace x with u to respond) Puget Sound Oracle Users Group www.psoug.org I hope you mean Jaro Winkler... Shakespeare |
#42
| |||
| |||
|
|
"DA Morgan" <damorgan (AT) psoug (DOT) org> schreef in bericht news:1219086595.356745 (AT) bubbleator (DOT) drizzle.com... merkury wrote: Hi Roelof Schierbeek, Thank you for your reply. I am sorry but your solution will not work on our datasets. The first 15 characters may occur in each string even if they are not "near". What I need is, that many words must coincide and many 2word or 3word combinations (maybe even more) should be the same. I guess this is more likely to fit our needs. I suppose de-duplication is only relevant for duplicates not for "near" duplicates. So our problem is to find for each dataset all nearest neighbours. We (will) have 20 million datasets. This means a comparison of 20million times 20million comparisons, if there is no better approach. Thanks merkury R. Schierbeek schrieb: Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury Look at the UTL_MATCH built in package. It contains an API for both the JARO WANKLER and LEVENSHTEIN algorithms. -- Daniel A. Morgan Oracle Ace Director & Instructor University of Washington damorgan@x.washington.edu (replace x with u to respond) Puget Sound Oracle Users Group www.psoug.org I hope you mean Jaro Winkler... Shakespeare |
#43
| |||
| |||
|
|
"DA Morgan" <damorgan (AT) psoug (DOT) org> schreef in bericht news:1219086595.356745 (AT) bubbleator (DOT) drizzle.com... merkury wrote: Hi Roelof Schierbeek, Thank you for your reply. I am sorry but your solution will not work on our datasets. The first 15 characters may occur in each string even if they are not "near". What I need is, that many words must coincide and many 2word or 3word combinations (maybe even more) should be the same. I guess this is more likely to fit our needs. I suppose de-duplication is only relevant for duplicates not for "near" duplicates. So our problem is to find for each dataset all nearest neighbours. We (will) have 20 million datasets. This means a comparison of 20million times 20million comparisons, if there is no better approach. Thanks merkury R. Schierbeek schrieb: Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury Look at the UTL_MATCH built in package. It contains an API for both the JARO WANKLER and LEVENSHTEIN algorithms. -- Daniel A. Morgan Oracle Ace Director & Instructor University of Washington damorgan@x.washington.edu (replace x with u to respond) Puget Sound Oracle Users Group www.psoug.org I hope you mean Jaro Winkler... Shakespeare |
![]() |
| Thread Tools | |
| Display Modes | |
| |