dbTalk Databases Forums  

Cleaning data - getting rid of duplicate rows

comp.databases.ms-sqlserver comp.databases.ms-sqlserver


Discuss Cleaning data - getting rid of duplicate rows in the comp.databases.ms-sqlserver forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
teddysnips@hotmail.com
 
Posts: n/a

Default Cleaning data - getting rid of duplicate rows - 08-20-2007 , 06:47 AM






Data Cleansing:

In the example (SQL Server DDL below) there are two tables -
ExampleCustomer, and ExampleCar.

ExampleCar is a lookup table. The ExampleCustomer table has a foreign
key to the ExampleCar table. There should be two rows in there, "Ford
Focus", and "Ford Galaxy". This table populates a drop-down list in
the application, ordered by CarID. So a user is adding a new Customer
record, selects a car from the list, and hey Presto!

Someone came along and messed with the data, so now there are two rows
for each car. Yes, I know I should have set the CarType column
unique, but I didn't.

The user has done what users always do - work around the problem. So
she has added some customers using the first occurrence of the car in
the drop-down list, except for the last record when she unaccountably
selected the second occurrence.

I have been tasked with cleaning up the data. The object is to delete
all duplicate rows, but without leaving any orphaned references. At
the end of the exercise the rows in the tables should be:

CarID CarType
1 Ford Focus
3 Ford Galaxy

PersonName CarID
Adam Smith 1
Ben Smith 1
Colin Smith 3
Dave Smith 3
Edward Smith 3
Fred Smith 3

Note that the CarID for Fred Smith has been updated to the
predominating row for the Ford Galaxy type.

Is this clear? Can anyone see a way of doing this in a small number
of queries? I could think of a way of doing it using cursors etc.,
but the prevailing wisdom is that there's ALWAYS a way to do it
without using cursors.

Thanks

Edward


DDL
=========================
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCar]
GO

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCustomer]
GO

CREATE TABLE [dbo].[ExampleCar] (
[CarID] [int] IDENTITY (1, 1) NOT NULL ,
[CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ExampleCustomer] (
[PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL ,
[CarID] [int] NULL
) ON [PRIMARY]
GO

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Adam Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Ben Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Colin Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Dave Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Edward Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Fred Smith', 4)


Reply With Quote
  #2  
Old   
jhofmeyr@googlemail.com
 
Posts: n/a

Default Re: Cleaning data - getting rid of duplicate rows - 08-20-2007 , 07:07 AM






On 20 Aug, 12:47, teddysn... (AT) hotmail (DOT) com wrote:
Quote:
Data Cleansing:

In the example (SQL Server DDL below) there are two tables -
ExampleCustomer, and ExampleCar.

ExampleCar is a lookup table. The ExampleCustomer table has a foreign
key to the ExampleCar table. There should be two rows in there, "Ford
Focus", and "Ford Galaxy". This table populates a drop-down list in
the application, ordered by CarID. So a user is adding a new Customer
record, selects a car from the list, and hey Presto!

Someone came along and messed with the data, so now there are two rows
for each car. Yes, I know I should have set the CarType column
unique, but I didn't.

The user has done what users always do - work around the problem. So
she has added some customers using the first occurrence of the car in
the drop-down list, except for the last record when she unaccountably
selected the second occurrence.

I have been tasked with cleaning up the data. The object is to delete
all duplicate rows, but without leaving any orphaned references. At
the end of the exercise the rows in the tables should be:

CarID CarType
1 Ford Focus
3 Ford Galaxy

PersonName CarID
Adam Smith 1
Ben Smith 1
Colin Smith 3
Dave Smith 3
Edward Smith 3
Fred Smith 3

Note that the CarID for Fred Smith has been updated to the
predominating row for the Ford Galaxy type.

Is this clear? Can anyone see a way of doing this in a small number
of queries? I could think of a way of doing it using cursors etc.,
but the prevailing wisdom is that there's ALWAYS a way to do it
without using cursors.

Thanks

Edward

DDL
=========================
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCar]
GO

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCustomer]
GO

CREATE TABLE [dbo].[ExampleCar] (
[CarID] [int] IDENTITY (1, 1) NOT NULL ,
[CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ExampleCustomer] (
[PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL ,
[CarID] [int] NULL
) ON [PRIMARY]
GO

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Adam Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Ben Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Colin Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Dave Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Edward Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Fred Smith', 4)

SELECT
ecust.PersonName
, (
SELECT MIN(CarID)
FROM [ExampleCar]
WHERE [CarType] IN (
SELECT [CarType]
FROM [ExampleCar]
WHERE CarID = ecust.CarID
)
) AS CarID
FROM [ExampleCustomer] ecust

Not pretty but that should get you started.

Good Luck
J



Reply With Quote
  #3  
Old   
Erland Sommarskog
 
Posts: n/a

Default Re: Cleaning data - getting rid of duplicate rows - 08-20-2007 , 04:52 PM



(teddysnips (AT) hotmail (DOT) com) writes:
Quote:
I have been tasked with cleaning up the data. The object is to delete
all duplicate rows, but without leaving any orphaned references. At
the end of the exercise the rows in the tables should be:

CarID CarType
1 Ford Focus
3 Ford Galaxy

PersonName CarID
Adam Smith 1
Ben Smith 1
Colin Smith 3
Dave Smith 3
Edward Smith 3
Fred Smith 3

Note that the CarID for Fred Smith has been updated to the
predominating row for the Ford Galaxy type.
UPDATE ExampleCustomer
SET CarID = Mincar.CarID
FROM ExampleCustomer C
JOIN ExampleCar Car ON C.CarID = Car.CarID
JOIN (SELECT CarType, CarID = MIN(CarID)
FROM ExampleCar
GROUP BY CarType) AS Mincar ON Mincar.CarType = Car.CarType

DELETE ExampleCar
FROM ExampleCar C
WHERE EXISTS (SELECT *
FROM ExampleCar C1
WHERE C1.CarType = C.CarType
AND C.CarID > C1.CarID)


There is no foreign declared in the example, but I hope that there is
in the real case.

Of course, risk is that your uzer have spelt the duplicates differently.

--
Erland Sommarskog, SQL Server MVP, esquel (AT) sommarskog (DOT) se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.