![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#3
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#4
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#5
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#6
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#7
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#8
| |||
| |||
|
|
Hello merkury, You might try the instr function: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.key != E.key OR: select s.naam ,s.key from temp S , temp E where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 ) and s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text)) ); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury |
#9
| |||
| |||
|
|
Hi Roelof Schierbeek, Thank you for your reply. I am sorry but your solution will not work on our datasets. The first 15 characters may occur in each string even if they are not "near". What I need is, that many words must coincide and many 2word or 3word combinations (maybe even more) should be the same. I guess this is more likely to fit our needs. I suppose de-duplication is only relevant for duplicates not for "near" duplicates. So our problem is to find for each dataset all nearest neighbours. We (will) have 20 million datasets. This means a comparison of 20million times 20million comparisons, if there is no better approach. Thanks merkury R. Schierbeek schrieb: Hello merkury, You might try the instr function: * select s.naam ,s.key * from * temp *S * , * * *temp *E * where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 * * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) >0 * * * * ) * and * s.key *!= E.key OR: * * select s.naam ,s.key * * from * temp *S * * , * * *temp *E * * where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) >0 * * * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15)) > 0 * * * * * ) * * and * s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( * *translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') *,1, nvl(p_length, LENGTH (p_text)) *); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten * *Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.oberm... (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: * 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) * 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury- Hide quoted text - - Show quoted text - |
#10
| |||
| |||
|
|
Hi Roelof Schierbeek, Thank you for your reply. I am sorry but your solution will not work on our datasets. The first 15 characters may occur in each string even if they are not "near". What I need is, that many words must coincide and many 2word or 3word combinations (maybe even more) should be the same. I guess this is more likely to fit our needs. I suppose de-duplication is only relevant for duplicates not for "near" duplicates. So our problem is to find for each dataset all nearest neighbours. We (will) have 20 million datasets. This means a comparison of 20million times 20million comparisons, if there is no better approach. Thanks merkury R. Schierbeek schrieb: Hello merkury, You might try the instr function: * select s.naam ,s.key * from * temp *S * , * * *temp *E * where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 * * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) >0 * * * * ) * and * s.key *!= E.key OR: * * select s.naam ,s.key * * from * temp *S * * , * * *temp *E * * where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) >0 * * * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15)) > 0 * * * * * ) * * and * s.rowid != E.rowid Also you can remove "symbols" or numbers from a string like this : substr ( * *translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') *,1, nvl(p_length, LENGTH (p_text)) *); But there are many tools on the market; google for Duplicating or De-duplicating tool. Met vriendelijke groeten * *Roelof Schierbeek , NL ----- Original Message ----- From: "merkury" <david.oberm... (AT) idealo (DOT) de Newsgroups: comp.databases.oracle.tools Sent: Friday, August 15, 2008 8:08 PM Subject: near duplicates in short text fields Hi, can anybody tell me how to find near duplicates in a large amount (20 million) short text labels? Is there any database tool which does just this? I give you some examples: not near: Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) near: Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von) near: 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74) 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) near: * 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74) * 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74) Thanks merkury- Hide quoted text - - Show quoted text - |
![]() |
| Thread Tools | |
| Display Modes | |
| |