dbTalk Databases Forums  

near duplicates in short text fields

comp.databases comp.databases


Discuss near duplicates in short text fields in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
merkury
 
Posts: n/a

Default near duplicates in short text fields - 08-15-2008 , 01:05 PM






Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury

Reply With Quote
  #2  
Old   
Tonkuma
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-16-2008 , 07:28 AM






1) Could you explain the rules to judge "near duplicates".
I couldn't understand the rules from your examples.
2) You showed...
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

In the two example, following text is common.
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

Is it mean follwing two texts are near?
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

3) Anyhow, SOUNDEX came in my mind first.
But, SOUNDEX may not be a solution. Because it returns different
results for your second case.
Like this:
------------------------------ Commands Entered
------------------------------
SELECT k
, CAST( CASE FLOOR(k)
WHEN k THEN
text
ELSE SOUNDEX(text)
END AS VARCHAR(10)
) "Judgement"
FROM (VALUES
(1.0, 'not near:' )
,(1.1, 'Rugby Polo - black/white - S; (Angebot von
Kabelmeister)' )
,(1.2, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.0, 'near:' )
,(2.1, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.2, 'Shirt Striped - aqua/white - S; (Angebot von)' )
,(3.0, 'near:' )
,(3.1, '301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L
(eBay Shop jeanspoint74)' )
,(3.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.0, 'near:' )
,(4.1, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M
(eBay Shop eanspoint74)' )
) T(k, text)
;
------------------------------------------------------------------------------

K Judgement
---- ----------
1.0 not near:
1.1 R211
1.2 R212
2.0 near:
2.1 R212
2.2 S632
3.0 near:
3.1 L621
3.2 L621
4.0 near:
4.1 L621
4.2 L621

12 record(s) selected.

Reply With Quote
  #3  
Old   
Tonkuma
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-16-2008 , 07:28 AM



1) Could you explain the rules to judge "near duplicates".
I couldn't understand the rules from your examples.
2) You showed...
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

In the two example, following text is common.
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

Is it mean follwing two texts are near?
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

3) Anyhow, SOUNDEX came in my mind first.
But, SOUNDEX may not be a solution. Because it returns different
results for your second case.
Like this:
------------------------------ Commands Entered
------------------------------
SELECT k
, CAST( CASE FLOOR(k)
WHEN k THEN
text
ELSE SOUNDEX(text)
END AS VARCHAR(10)
) "Judgement"
FROM (VALUES
(1.0, 'not near:' )
,(1.1, 'Rugby Polo - black/white - S; (Angebot von
Kabelmeister)' )
,(1.2, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.0, 'near:' )
,(2.1, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.2, 'Shirt Striped - aqua/white - S; (Angebot von)' )
,(3.0, 'near:' )
,(3.1, '301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L
(eBay Shop jeanspoint74)' )
,(3.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.0, 'near:' )
,(4.1, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M
(eBay Shop eanspoint74)' )
) T(k, text)
;
------------------------------------------------------------------------------

K Judgement
---- ----------
1.0 not near:
1.1 R211
1.2 R212
2.0 near:
2.1 R212
2.2 S632
3.0 near:
3.1 L621
3.2 L621
4.0 near:
4.1 L621
4.2 L621

12 record(s) selected.

Reply With Quote
  #4  
Old   
Tonkuma
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-16-2008 , 07:28 AM



1) Could you explain the rules to judge "near duplicates".
I couldn't understand the rules from your examples.
2) You showed...
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

In the two example, following text is common.
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

Is it mean follwing two texts are near?
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

3) Anyhow, SOUNDEX came in my mind first.
But, SOUNDEX may not be a solution. Because it returns different
results for your second case.
Like this:
------------------------------ Commands Entered
------------------------------
SELECT k
, CAST( CASE FLOOR(k)
WHEN k THEN
text
ELSE SOUNDEX(text)
END AS VARCHAR(10)
) "Judgement"
FROM (VALUES
(1.0, 'not near:' )
,(1.1, 'Rugby Polo - black/white - S; (Angebot von
Kabelmeister)' )
,(1.2, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.0, 'near:' )
,(2.1, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.2, 'Shirt Striped - aqua/white - S; (Angebot von)' )
,(3.0, 'near:' )
,(3.1, '301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L
(eBay Shop jeanspoint74)' )
,(3.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.0, 'near:' )
,(4.1, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M
(eBay Shop eanspoint74)' )
) T(k, text)
;
------------------------------------------------------------------------------

K Judgement
---- ----------
1.0 not near:
1.1 R211
1.2 R212
2.0 near:
2.1 R212
2.2 S632
3.0 near:
3.1 L621
3.2 L621
4.0 near:
4.1 L621
4.2 L621

12 record(s) selected.

Reply With Quote
  #5  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 03:38 AM



Tonkuma schrieb:
Quote:
1) Could you explain the rules to judge "near duplicates".
I couldn't understand the rules from your examples.
The examples are offers of different shops in the internet. The problem
is, that the description of the same article differ depending on the
shop. I want to find out the rules that describe it best, when humans
see that two of them describe the same product ( which differs in color
an size )

Maybe the rule is like this:
Many words are the same. And many of the (2- ore more) word combinations
are the same. Maybe I have to filter the size and the color out first.



Quote:
2) You showed...
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

In the two example, following text is common.
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

Is it mean follwing two texts are near?
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

yes they are near: same article in different sizes and colors (even in
the same shop)

Quote:
3) Anyhow, SOUNDEX came in my mind first.
But, SOUNDEX may not be a solution. Because it returns different
results for your second case.
Yes soundex might work for articles with many readable words, but not
with some technical articles with names consisting of consonants. There
it is more important that they do not sound similar but are exactly the
same.
I think an approach with a high percentage of "same" words an word
combinations would be better.
I heard of a shingling method ("shingle printing" etc by A. Z. Broder
and others. ) to do just this, but could not find an implementation to
try it out.


Thank you for your effort so far.

regards

merkury


Quote:
Like this:
------------------------------ Commands Entered
------------------------------
SELECT k
, CAST( CASE FLOOR(k)
WHEN k THEN
text
ELSE SOUNDEX(text)
END AS VARCHAR(10)
) "Judgement"
FROM (VALUES
(1.0, 'not near:' )
,(1.1, 'Rugby Polo - black/white - S; (Angebot von
Kabelmeister)' )
,(1.2, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.0, 'near:' )
,(2.1, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.2, 'Shirt Striped - aqua/white - S; (Angebot von)' )
,(3.0, 'near:' )
,(3.1, '301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L
(eBay Shop jeanspoint74)' )
,(3.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.0, 'near:' )
,(4.1, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M
(eBay Shop eanspoint74)' )
) T(k, text)
;
------------------------------------------------------------------------------

K Judgement
---- ----------
1.0 not near:
1.1 R211
1.2 R212
2.0 near:
2.1 R212
2.2 S632
3.0 near:
3.1 L621
3.2 L621
4.0 near:
4.1 L621
4.2 L621

12 record(s) selected.

Reply With Quote
  #6  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 03:38 AM



Tonkuma schrieb:
Quote:
1) Could you explain the rules to judge "near duplicates".
I couldn't understand the rules from your examples.
The examples are offers of different shops in the internet. The problem
is, that the description of the same article differ depending on the
shop. I want to find out the rules that describe it best, when humans
see that two of them describe the same product ( which differs in color
an size )

Maybe the rule is like this:
Many words are the same. And many of the (2- ore more) word combinations
are the same. Maybe I have to filter the size and the color out first.



Quote:
2) You showed...
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

In the two example, following text is common.
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

Is it mean follwing two texts are near?
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

yes they are near: same article in different sizes and colors (even in
the same shop)

Quote:
3) Anyhow, SOUNDEX came in my mind first.
But, SOUNDEX may not be a solution. Because it returns different
results for your second case.
Yes soundex might work for articles with many readable words, but not
with some technical articles with names consisting of consonants. There
it is more important that they do not sound similar but are exactly the
same.
I think an approach with a high percentage of "same" words an word
combinations would be better.
I heard of a shingling method ("shingle printing" etc by A. Z. Broder
and others. ) to do just this, but could not find an implementation to
try it out.


Thank you for your effort so far.

regards

merkury


Quote:
Like this:
------------------------------ Commands Entered
------------------------------
SELECT k
, CAST( CASE FLOOR(k)
WHEN k THEN
text
ELSE SOUNDEX(text)
END AS VARCHAR(10)
) "Judgement"
FROM (VALUES
(1.0, 'not near:' )
,(1.1, 'Rugby Polo - black/white - S; (Angebot von
Kabelmeister)' )
,(1.2, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.0, 'near:' )
,(2.1, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.2, 'Shirt Striped - aqua/white - S; (Angebot von)' )
,(3.0, 'near:' )
,(3.1, '301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L
(eBay Shop jeanspoint74)' )
,(3.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.0, 'near:' )
,(4.1, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M
(eBay Shop eanspoint74)' )
) T(k, text)
;
------------------------------------------------------------------------------

K Judgement
---- ----------
1.0 not near:
1.1 R211
1.2 R212
2.0 near:
2.1 R212
2.2 S632
3.0 near:
3.1 L621
3.2 L621
4.0 near:
4.1 L621
4.2 L621

12 record(s) selected.

Reply With Quote
  #7  
Old   
merkury
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 03:38 AM



Tonkuma schrieb:
Quote:
1) Could you explain the rules to judge "near duplicates".
I couldn't understand the rules from your examples.
The examples are offers of different shops in the internet. The problem
is, that the description of the same article differ depending on the
shop. I want to find out the rules that describe it best, when humans
see that two of them describe the same product ( which differs in color
an size )

Maybe the rule is like this:
Many words are the same. And many of the (2- ore more) word combinations
are the same. Maybe I have to filter the size and the color out first.



Quote:
2) You showed...
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

In the two example, following text is common.
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

Is it mean follwing two texts are near?
near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)

yes they are near: same article in different sizes and colors (even in
the same shop)

Quote:
3) Anyhow, SOUNDEX came in my mind first.
But, SOUNDEX may not be a solution. Because it returns different
results for your second case.
Yes soundex might work for articles with many readable words, but not
with some technical articles with names consisting of consonants. There
it is more important that they do not sound similar but are exactly the
same.
I think an approach with a high percentage of "same" words an word
combinations would be better.
I heard of a shingling method ("shingle printing" etc by A. Z. Broder
and others. ) to do just this, but could not find an implementation to
try it out.


Thank you for your effort so far.

regards

merkury


Quote:
Like this:
------------------------------ Commands Entered
------------------------------
SELECT k
, CAST( CASE FLOOR(k)
WHEN k THEN
text
ELSE SOUNDEX(text)
END AS VARCHAR(10)
) "Judgement"
FROM (VALUES
(1.0, 'not near:' )
,(1.1, 'Rugby Polo - black/white - S; (Angebot von
Kabelmeister)' )
,(1.2, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.0, 'near:' )
,(2.1, 'Rugby Shirt Striped - aqua/white - S; (Angebot von
Kabelmeister)' )
,(2.2, 'Shirt Striped - aqua/white - S; (Angebot von)' )
,(3.0, 'near:' )
,(3.1, '301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L
(eBay Shop jeanspoint74)' )
,(3.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.0, 'near:' )
,(4.1, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in
L (eBay Shop jeanspoint74)' )
,(4.2, '482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M
(eBay Shop eanspoint74)' )
) T(k, text)
;
------------------------------------------------------------------------------

K Judgement
---- ----------
1.0 not near:
1.1 R211
1.2 R212
2.0 near:
2.1 R212
2.2 S632
3.0 near:
3.1 L621
3.2 L621
4.0 near:
4.1 L621
4.2 L621

12 record(s) selected.

Reply With Quote
  #8  
Old   
--CELKO--
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 09:11 AM



http://en.wikipedia.org/wiki/Approxi...tring_matching

Then follow the links given there for some algorithms. This is best
done in procedural code, outside the DB. SQL was not meant as a text
handling language.

Reply With Quote
  #9  
Old   
--CELKO--
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 09:11 AM



http://en.wikipedia.org/wiki/Approxi...tring_matching

Then follow the links given there for some algorithms. This is best
done in procedural code, outside the DB. SQL was not meant as a text
handling language.

Reply With Quote
  #10  
Old   
--CELKO--
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-18-2008 , 09:11 AM



http://en.wikipedia.org/wiki/Approxi...tring_matching

Then follow the links given there for some algorithms. This is best
done in procedural code, outside the DB. SQL was not meant as a text
handling language.

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.