dbTalk Databases Forums  

near duplicates in short text fields

comp.databases.oracle.tools comp.databases.oracle.tools


Discuss near duplicates in short text fields in the comp.databases.oracle.tools forum.



Reply
 
Thread Tools Display Modes
  #41  
Old   
DA Morgan
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-19-2008 , 08:07 AM






Shakespeare wrote:
Quote:
"DA Morgan" <damorgan (AT) psoug (DOT) org> schreef in bericht
news:1219086595.356745 (AT) bubbleator (DOT) drizzle.com...
merkury wrote:
Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1,
nvl(p_length, LENGTH (p_text))
);

But there are many tools on the market; google for Duplicating or
De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL

----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury
Look at the UTL_MATCH built in package. It contains an API for both
the JARO WANKLER and LEVENSHTEIN algorithms.
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org

I hope you mean Jaro Winkler...

Shakespeare
It's a typo ... That's my story ... And I'm sticking with it. <g>
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org


Reply With Quote
  #42  
Old   
DA Morgan
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-19-2008 , 08:07 AM






Shakespeare wrote:
Quote:
"DA Morgan" <damorgan (AT) psoug (DOT) org> schreef in bericht
news:1219086595.356745 (AT) bubbleator (DOT) drizzle.com...
merkury wrote:
Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1,
nvl(p_length, LENGTH (p_text))
);

But there are many tools on the market; google for Duplicating or
De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL

----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury
Look at the UTL_MATCH built in package. It contains an API for both
the JARO WANKLER and LEVENSHTEIN algorithms.
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org

I hope you mean Jaro Winkler...

Shakespeare
It's a typo ... That's my story ... And I'm sticking with it. <g>
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org


Reply With Quote
  #43  
Old   
DA Morgan
 
Posts: n/a

Default Re: near duplicates in short text fields - 08-19-2008 , 08:07 AM



Shakespeare wrote:
Quote:
"DA Morgan" <damorgan (AT) psoug (DOT) org> schreef in bericht
news:1219086595.356745 (AT) bubbleator (DOT) drizzle.com...
merkury wrote:
Hi Roelof Schierbeek,


Thank you for your reply.

I am sorry but your solution will not work on our datasets.
The first 15 characters may occur in each string even if they are not
"near".
What I need is, that many words must coincide and many 2word or 3word
combinations (maybe even more) should be the same. I guess this is more
likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near"
duplicates.

So our problem is to find for each dataset all nearest neighbours. We
(will) have 20 million datasets. This means a comparison of 20million
times 20million comparisons, if there is no better approach.


Thanks

merkury




R. Schierbeek schrieb:

Hello merkury,

You might try the instr function:

select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.key != E.key

OR:
select s.naam ,s.key
from temp S
, temp E
where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
)
and s.rowid != E.rowid

Also you can remove "symbols" or numbers from a string like this :
substr (

translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1,
nvl(p_length, LENGTH (p_text))
);

But there are many tools on the market; google for Duplicating or
De-duplicating tool.

Met vriendelijke groeten

Roelof Schierbeek , NL

----- Original Message ----- From: "merkury" <david.obermann (AT) idealo (DOT) de
Newsgroups: comp.databases.oracle.tools
Sent: Friday, August 15, 2008 8:08 PM
Subject: near duplicates in short text fields



Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury
Look at the UTL_MATCH built in package. It contains an API for both
the JARO WANKLER and LEVENSHTEIN algorithms.
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org

I hope you mean Jaro Winkler...

Shakespeare
It's a typo ... That's my story ... And I'm sticking with it. <g>
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.