dbTalk Databases Forums  

Re: Normalizing and Removing Duplicates (and near duplicates)

comp.databases.ms-access comp.databases.ms-access


Discuss Re: Normalizing and Removing Duplicates (and near duplicates) in the comp.databases.ms-access forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Tom van Stiphout
 
Posts: n/a

Default Re: Normalizing and Removing Duplicates (and near duplicates) - 07-09-2003 , 10:15 PM






On 9 Jul 2003 16:51:19 -0700, pietlinden (AT) hotmail (DOT) com (Pieter Linden)
wrote:

I recently had a similar need, and found the Ratcliff/Obershelp
algorithm to be most useful. I'm sure you can find the code on the
web, like I did.

-Tom.


Quote:
Sorry to ask what seems like such a simple question, but...

I was just given a database that badly needs to be redesigned... looks
like someone for whom Excel got too big...

Basic structure:
Client Info (Name, Address, Phone...)
Repair Info (Item, Service Date Info, some notes)

the problem is that the Client Info has duplicates and "semi"
duplicates. Well, if you look at it, you can tell the addresses are
the same, but you can't use SQL to identify them with complete
consistency. (The differences are slight and in inconsistent places).

What's the easiest way to go about fixing this?
Creating a list of unique Customers is easy.
Creating a list of the latest addresses is easy.

How would I go about identifying likely duplicates? (I have about
15,000 records in the database, so doing it manually would take a LONG
time!) I'd just like to get the bulk of the work done using SQL. I
know I'll have to do some of it manually...

Any pointers? (I'd loan ya mine, but he's out hunting...)

Thanks!
Pieter


Reply With Quote
  #2  
Old   
Lyle Fairfield
 
Posts: n/a

Default Re: Normalizing and Removing Duplicates (and near duplicates) - 07-10-2003 , 06:47 AM






pietlinden (AT) hotmail (DOT) com (Pieter Linden) wrote in
news:bf31e41b.0307091551.27739b14 (AT) posting (DOT) google.com:

Quote:
Sorry to ask what seems like such a simple question, but...

I was just given a database that badly needs to be redesigned... looks
like someone for whom Excel got too big...

Basic structure:
Client Info (Name, Address, Phone...)
Repair Info (Item, Service Date Info, some notes)

the problem is that the Client Info has duplicates and "semi"
duplicates. Well, if you look at it, you can tell the addresses are
the same, but you can't use SQL to identify them with complete
consistency. (The differences are slight and in inconsistent places).

What's the easiest way to go about fixing this?
Creating a list of unique Customers is easy.
Creating a list of the latest addresses is easy.

How would I go about identifying likely duplicates? (I have about
15,000 records in the database, so doing it manually would take a LONG
time!) I'd just like to get the bulk of the work done using SQL. I
know I'll have to do some of it manually...

Any pointers? (I'd loan ya mine, but he's out hunting...)

Thanks!
Pieter
The Soundex Function produces a four character string. If the Soundex
result of two strings is identical then one may conclude that the two
strings "sound the same".
But testing for identical Soundex results may not select pairs of strings
which are very similar. The Difference Function counts the number of
similarities in the four characters produced by Soundex. A result of 4
means they are all the same. 3 may be a suitable criterion for considering
that strings refer to the same thing.
Both FoxPro and SQL Server (including MSDE) have these functions available.
If you have either of these you might wish to port your data to them,
process it and then port it back. I believe, but am not sure that you could
use FoxPro to process your JET data directly.
If you don't have FoxPro or SQL Server, I expect you could find these
functions as VBA in a Google search of CDMA.

I once had the task of standardizing many thousands of records entered
freehand in Cobol as we moved to something else, can't remember what. I was
astounded at how many ways "Burlington" could be spelled. Then I started on
"Mississauga"!

--
Lyle



Reply With Quote
  #3  
Old   
David W. Fenton
 
Posts: n/a

Default Re: Normalizing and Removing Duplicates (and near duplicates) - 07-10-2003 , 09:11 AM



tom7744 (AT) no (DOT) spam.cox.net (Tom van Stiphout) wrote in
<8empgvohl4odmffg9vap0ia74eer5j2aoh (AT) 4ax (DOT) com>:

Quote:
I recently had a similar need, and found the Ratcliff/Obershelp
algorithm to be most useful. I'm sure you can find the code on
the web, like I did.
We had a recent discussion of this here in CDMA:

http://groups.google.com/groups?selm...waynet%4024.16
8.128.90

(all on one line, of course)

The post that URL points to has code for the Simil algorithm, and
implementation of the Ratcliff/Obershelp algorithm.

For what it's worth, I was unimpressed with the results. I didn't
find the algorithm logical in its outcome.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc


Reply With Quote
  #4  
Old   
Pieter Linden
 
Posts: n/a

Default Re: Normalizing and Removing Duplicates (and near duplicates) - 07-29-2003 , 09:38 PM



I have SQL Server, so I guess I'll upsize, clean and then export.
Part of the problem is that while there may be spelling errors, there
are also slight variations on name spellings... like the address might
change or whatever. I'm sorely tempted to tell the people who entered
the data to remove the duplicates from the 15,000 records manually...
then they won't do that kind of thing again!
(but then I don't think they'd want my help again, either...)

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.