dbTalk Databases Forums  

near duplicates in short text fields

comp.databases comp.databases


Discuss near duplicates in short text fields in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
toby
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-19-2008 , 08:23 AM






On Aug 15, 3:05*pm, merkury <david.oberm... (AT) idealo (DOT) de> wrote:
Quote:
Hi,

can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
* *482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
* *482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

Thanks

merkury
I think a fruitful direction is to avoid a generic comparison
algorithm on these strings, but rather to exploit your domain
knowledge. For example, removing the words that you know are likely to
refer to non-distinguishing attributes (all colour names in whatever
languages you are using) - *then* testing string equality; this
handles the 2nd example.

Your first example could be handled by 'regularising' synonymous terms
or removing redundant terms ('rugby shirt' == 'shirt' for your
purposes). Again, this should be based on domain knowledge.

Hope this helps.
--Toby


Reply With Quote
  #12  
Old   
toby
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-19-2008 , 08:23 AM






On Aug 15, 3:05*pm, merkury <david.oberm... (AT) idealo (DOT) de> wrote:
Quote:
Hi,

can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
* *482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
* *482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

Thanks

merkury
I think a fruitful direction is to avoid a generic comparison
algorithm on these strings, but rather to exploit your domain
knowledge. For example, removing the words that you know are likely to
refer to non-distinguishing attributes (all colour names in whatever
languages you are using) - *then* testing string equality; this
handles the 2nd example.

Your first example could be handled by 'regularising' synonymous terms
or removing redundant terms ('rugby shirt' == 'shirt' for your
purposes). Again, this should be based on domain knowledge.

Hope this helps.
--Toby


Reply With Quote
  #13  
Old   
toby
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-19-2008 , 08:23 AM



On Aug 15, 3:05*pm, merkury <david.oberm... (AT) idealo (DOT) de> wrote:
Quote:
Hi,

can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
* *482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
* *482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

Thanks

merkury
I think a fruitful direction is to avoid a generic comparison
algorithm on these strings, but rather to exploit your domain
knowledge. For example, removing the words that you know are likely to
refer to non-distinguishing attributes (all colour names in whatever
languages you are using) - *then* testing string equality; this
handles the 2nd example.

Your first example could be handled by 'regularising' synonymous terms
or removing redundant terms ('rugby shirt' == 'shirt' for your
purposes). Again, this should be based on domain knowledge.

Hope this helps.
--Toby


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.