dbTalk Databases Forums  

near duplicates in short text fields

comp.databases.oracle.tools comp.databases.oracle.tools


Discuss near duplicates in short text fields in the comp.databases.oracle.tools forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
merkury
 
Posts: n/a

Default near duplicates in short text fields - 08-15-2008 , 01:08 PM






Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury

Reply With Quote
  #2  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM






Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #3  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM



Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #4  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM



Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #5  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM



Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #6  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM



Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #7  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM



Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #8  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 07:59 AM



Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Quote:
Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))

);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL


----- Original Message -----
From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury


Reply With Quote
  #9  
Old   
rogergorden...@gmail.com
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 10:45 AM



On Aug 18, 8:59*am, merkury <david.oberm... (AT) idealo (DOT) de> wrote:
Quote:
Hi Roelof Schierbeek,

Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.

Thanks

merkury

R. Schierbeek schrieb:



Hello merkury,

You might try the instr function:

* select s.naam ,s.key
* from * temp *S
* , * * *temp *E
* where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
* * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) >0
* * * * )
* and * s.key *!= E.key

OR:
* * select s.naam ,s.key
* * from * temp *S
* * , * * *temp *E
* * where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) >0
* * * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15)) > 0
* * * * * )
* * and * s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

* *translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') *,1, nvl(p_length, LENGTH (p_text))

*);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

* *Roelof Schierbeek , NL

----- Original Message -----
From: "merkury" <david.oberm... (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields

Hi,

can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
* 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
* 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

Thanks

merkury- Hide quoted text -

- Show quoted text -
You'll most likely have to write some sort of algorythm yourself to
come up with matches and near matches.
I had to do something similar while merging a user list that was store
in 2 separate database tables, (different spellings of last names,
addresses not fully filled in etc.

Good luck.

Roger Gorden
Senior Database Programmer/Analyst
John Wiley and Sons, Inc.


Reply With Quote
  #10  
Old   
rogergorden...@gmail.com
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 10:45 AM



On Aug 18, 8:59*am, merkury <david.oberm... (AT) idealo (DOT) de> wrote:
Quote:
Hi Roelof Schierbeek,

Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.

Thanks

merkury

R. Schierbeek schrieb:



Hello merkury,

You might try the instr function:

* select s.naam ,s.key
* from * temp *S
* , * * *temp *E
* where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
* * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) >0
* * * * )
* and * s.key *!= E.key

OR:
* * select s.naam ,s.key
* * from * temp *S
* * , * * *temp *E
* * where ( *instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) >0
* * * * * or instr( upper(E.naam) ,substr(upper(s.naam),1,15)) > 0
* * * * * )
* * and * s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

* *translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') *,1, nvl(p_length, LENGTH (p_text))

*);

But there are many tools on the market; google for Duplicating or De-duplicating tool.

Met vriendelijke groeten

* *Roelof Schierbeek , NL

----- Original Message -----
From: "merkury" <david.oberm... (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields

Hi,

can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
* 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
* 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

Thanks

merkury- Hide quoted text -

- Show quoted text -
You'll most likely have to write some sort of algorythm yourself to
come up with matches and near matches.
I had to do something similar while merging a user list that was store
in 2 separate database tables, (different spellings of last names,
addresses not fully filled in etc.

Good luck.

Roger Gorden
Senior Database Programmer/Analyst
John Wiley and Sons, Inc.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.