dbTalk Databases Forums  

delete dupes with a unique id using a temp table

comp.databases.mysql comp.databases.mysql


Discuss delete dupes with a unique id using a temp table in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
jr
 
Posts: n/a

Default delete dupes with a unique id using a temp table - 01-03-2011 , 06:37 PM






I need to delete the dupes across bu's. The records are unique based
on bu+ndc but there are 400+ duplicates..
There is a sequential id that is unique. I am trying to remove these
400+ records.

1. SELECT *
FROM ci
GROUP BY bu,ndc
HAVING COUNT(concat(bu,ndc) )>1

This query yields 400+ dupes which I need to remove.


2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.


DELETE FROM ci b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc



3. I need to create a temp table with the unique ids in order to
compare the ci table to the duplicates.
I create the temp table with the SELECT from above that now has the
duplicate records in it with the id.
CREATE TABLE ci_tmp AS
(SELECT c.id,c.bu,c.zonenm,c.ndc
FROM ci c
GROUP BY c.bu,c.ndc
HAVING COUNT(concat(c.bu,c.ndc) )>1)

4. Next I have to run my delete query matching against the ids in the
temp table.
AT this point I am not sure exactly how to do this or what I need to
delete.
This is what I have but it isn't right because it doesn't delete any
records.


DELETE FROM cart_inventory
WHERE id IN (select id
FROM cart_inventory_tmp
GROUP BY bu,nationaldrugcode
HAVING COUNT(concat(bu,nationaldrugcode) )>1)


I also tried deleting using max(id) but I don't know if it requires
a self-join and I think the above process might be easier?

delete FROM cart_inventory as as a
max(id)
where concat(a.bu, a.ndc)=concat(b.bu,b.ndc)

If you could explain the process of Query#4 what I am trying to match
against that might help but I basically understand
that you need to get the id of one of the duplicates from the temp
table and match it against the ci table and delete that id.
It seems like it should work but it doesn't because I run query # 4
and do the select for the dupes and they are all still there..



thanks,

Reply With Quote
  #2  
Old   
onedbguru
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-03-2011 , 07:13 PM






On Jan 3, 7:37*pm, jr <jlro... (AT) yahoo (DOT) com> wrote:
Quote:
I need to delete the dupes across bu's. *The records are unique based
on bu+ndc but there are 400+ duplicates..
There is a sequential id that is unique. *I am trying to remove these
400+ records.

1. *SELECT *
FROM ci
GROUP BY bu,ndc
HAVING COUNT(concat(bu,ndc) )>1

This query yields 400+ dupes which I need to remove.

2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.

*DELETE FROM ci *b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING *COUNT(1)>1
) a
where *a.bu=b.bu
and a.ndc = b.ndc

3. * I need to create a temp table with the unique ids in order to
compare the ci table to the duplicates.
I create the temp table with the SELECT from above that now has the
duplicate records in it with the id.
CREATE TABLE *ci_tmp AS
(SELECT c.id,c.bu,c.zonenm,c.ndc
FROM ci c
GROUP BY c.bu,c.ndc
HAVING *COUNT(concat(c.bu,c.ndc) )>1)

4. *Next I have to run my delete query matching against the ids in the
temp table.
AT this point I am not sure exactly how to do this or what I need to
delete.
This is what I have but it isn't right because it doesn't delete any
records.

DELETE FROM cart_inventory
WHERE id IN (select id
FROM cart_inventory_tmp
GROUP BY bu,nationaldrugcode
HAVING COUNT(concat(bu,nationaldrugcode) )>1)

I also tried * deleting using max(id) but I don't know if it requires
a self-join and I think the above process might be easier?

delete *FROM cart_inventory as as a
max(id)
where *concat(a.bu, a.ndc)=concat(b.bu,b.ndc)

If you could explain the process of Query#4 what I am trying to match
against that might help but I basically understand
that you need to get the id of one of the duplicates from the temp
table and match it against the ci table and delete that id.
It seems like it should work but it doesn't because I run query # 4
and do the select for the dupes and they are all still there..

thanks,

1) are the rows truely DUPLICATE or do you just have a bu,ndc
duplicate - are there other columns and are they also duplicate or are
they unique? If "unique", which one is correct?

See: http://dev.mysql.com/doc/refman/5.5/en/alter-table.html and
search for IGNORE

"IGNORE is a MySQL extension to standard SQL. It controls how ALTER
TABLE works if there are duplicates on unique keys in the new table or
if warnings occur when strict mode is enabled. If IGNORE is not
specified, the copy is aborted and rolled back if duplicate-key errors
occur. If IGNORE is specified, only the first row is used of rows with
duplicates on a unique key, The other conflicting rows are deleted.
Incorrect values are truncated to the closest matching acceptable
value."

ALTER IGNORE TABLE ci ADD UNIQUE KEY (bu,ndc ) ;

How it knows which one is "first" I do not know...

I would test this before actually executingto ensure you are purging
the correct data.

Next step - figure out why you are getting unwanted dupes and fix the
problem.

Reply With Quote
  #3  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-03-2011 , 07:27 PM



On Jan 3, 5:13*pm, onedbguru <onedbg... (AT) yahoo (DOT) com> wrote:
Quote:
On Jan 3, 7:37*pm, jr <jlro... (AT) yahoo (DOT) com> wrote:









I need to delete the dupes across bu's. *The records are unique based
on bu+ndc but there are 400+ duplicates..
There is a sequential id that is unique. *I am trying to remove these
400+ records.

1. *SELECT *
FROM ci
GROUP BY bu,ndc
HAVING COUNT(concat(bu,ndc) )>1

This query yields 400+ dupes which I need to remove.

2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.

*DELETE FROM ci *b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING *COUNT(1)>1
) a
where *a.bu=b.bu
and a.ndc = b.ndc

3. * I need to create a temp table with the unique ids in order to
compare the ci table to the duplicates.
I create the temp table with the SELECT from above that now has the
duplicate records in it with the id.
CREATE TABLE *ci_tmp AS
(SELECT c.id,c.bu,c.zonenm,c.ndc
FROM ci c
GROUP BY c.bu,c.ndc
HAVING *COUNT(concat(c.bu,c.ndc) )>1)

4. *Next I have to run my delete query matching against the ids in the
temp table.
AT this point I am not sure exactly how to do this or what I need to
delete.
This is what I have but it isn't right because it doesn't delete any
records.

DELETE FROM cart_inventory
WHERE id IN (select id
FROM cart_inventory_tmp
GROUP BY bu,nationaldrugcode
HAVING COUNT(concat(bu,nationaldrugcode) )>1)

I also tried * deleting using max(id) but I don't know if it requires
a self-join and I think the above process might be easier?

delete *FROM cart_inventory as as a
max(id)
where *concat(a.bu, a.ndc)=concat(b.bu,b.ndc)

If you could explain the process of Query#4 what I am trying to match
against that might help but I basically understand
that you need to get the id of one of the duplicates from the temp
table and match it against the ci table and delete that id.
It seems like it should work but it doesn't because I run query # 4
and do the select for the dupes and they are all still there..

thanks,

1) are the rows truely DUPLICATE or do you just have a bu,ndc
duplicate - are there other columns and are they also duplicate or are
they unique? *If "unique", which one is correct?

See:http://dev.mysql.com/doc/refman/5.5/...-table.htmland
search for IGNORE

"IGNORE is a MySQL extension to standard SQL. It controls how ALTER
TABLE works if there are duplicates on unique keys in the new table or
if warnings occur when strict mode is enabled. If IGNORE is not
specified, the copy is aborted and rolled back if duplicate-key errors
occur. If IGNORE is specified, only the first row is used of rows with
duplicates on a unique key, The other conflicting rows are deleted.
Incorrect values are truncated to the closest matching acceptable
value."

ALTER IGNORE TABLE ci ADD UNIQUE KEY (bu,ndc ) ;

How it knows which one is "first" I do not know...

I would test this before actually executingto ensure you are purging
the correct data.

Next step - figure out why you are getting unwanted dupes and fix the
problem.
it is bu,ndc duplicates I am trying to get rid of, this is data
cleansing, the first or the 2nd duplicate doesn't matter, there is no
duplicate on the unique key
only duplicate on the bu,ndc. The concat(bu,ndc) is how I determine
the row is unique. The id was assigned as I uploaded the data.
Should I just use the max(id) if so how do I set up the self-join?
thanks,

Reply With Quote
  #4  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-03-2011 , 07:59 PM



On Jan 3, 5:13*pm, onedbguru <onedbg... (AT) yahoo (DOT) com> wrote:
Quote:
On Jan 3, 7:37*pm, jr <jlro... (AT) yahoo (DOT) com> wrote:









I need to delete the dupes across bu's. *The records are unique based
on bu+ndc but there are 400+ duplicates..
There is a sequential id that is unique. *I am trying to remove these
400+ records.

1. *SELECT *
FROM ci
GROUP BY bu,ndc
HAVING COUNT(concat(bu,ndc) )>1

This query yields 400+ dupes which I need to remove.

2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.

*DELETE FROM ci *b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING *COUNT(1)>1
) a
where *a.bu=b.bu
and a.ndc = b.ndc

3. * I need to create a temp table with the unique ids in order to
compare the ci table to the duplicates.
I create the temp table with the SELECT from above that now has the
duplicate records in it with the id.
CREATE TABLE *ci_tmp AS
(SELECT c.id,c.bu,c.zonenm,c.ndc
FROM ci c
GROUP BY c.bu,c.ndc
HAVING *COUNT(concat(c.bu,c.ndc) )>1)

4. *Next I have to run my delete query matching against the ids in the
temp table.
AT this point I am not sure exactly how to do this or what I need to
delete.
This is what I have but it isn't right because it doesn't delete any
records.

DELETE FROM cart_inventory
WHERE id IN (select id
FROM cart_inventory_tmp
GROUP BY bu,nationaldrugcode
HAVING COUNT(concat(bu,nationaldrugcode) )>1)

I also tried * deleting using max(id) but I don't know if it requires
a self-join and I think the above process might be easier?

delete *FROM cart_inventory as as a
max(id)
where *concat(a.bu, a.ndc)=concat(b.bu,b.ndc)

If you could explain the process of Query#4 what I am trying to match
against that might help but I basically understand
that you need to get the id of one of the duplicates from the temp
table and match it against the ci table and delete that id.
It seems like it should work but it doesn't because I run query # 4
and do the select for the dupes and they are all still there..

thanks,

1) are the rows truely DUPLICATE or do you just have a bu,ndc
duplicate - are there other columns and are they also duplicate or are
they unique? *If "unique", which one is correct?

See:http://dev.mysql.com/doc/refman/5.5/...-table.htmland
search for IGNORE

"IGNORE is a MySQL extension to standard SQL. It controls how ALTER
TABLE works if there are duplicates on unique keys in the new table or
if warnings occur when strict mode is enabled. If IGNORE is not
specified, the copy is aborted and rolled back if duplicate-key errors
occur. If IGNORE is specified, only the first row is used of rows with
duplicates on a unique key, The other conflicting rows are deleted.
Incorrect values are truncated to the closest matching acceptable
value."

ALTER IGNORE TABLE ci ADD UNIQUE KEY (bu,ndc ) ;

How it knows which one is "first" I do not know...

I would test this before actually executingto ensure you are purging
the correct data.

Next step - figure out why you are getting unwanted dupes and fix the
problem.
I tried the ALTER IGNORE and changed the unique key to the (bu,ndc)
on a backup table.
It changed the number of records in ci by 169 so I am afraid to do it
like that.
If I could figure out how to delete using the max(id) I would do it
that way.

Reply With Quote
  #5  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-04-2011 , 02:35 AM



On 2011-01-04 01:37, jr wrote:
[...]
Quote:
2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.


DELETE FROM ci b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

You can hide the table at a deeper nesting level, but it will probably
be very slow. Another option is to delete from a join (non-standard)
like (untested):

delete b.*
from ci as b
join (
SELECT bu,ndc
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) as a
on a.bu=b.bu
and a.ndc = b.ndc

If bu, ndu is a candidate key, you might want to add a unique constraint:

alter table ci add constraint ... unique (bu, ndc);

to enforce this.

[...]

/Lennart

Reply With Quote
  #6  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-04-2011 , 09:37 AM



On Jan 4, 12:35*am, Lennart Jonsson <erik.lennart.jons... (AT) gmail (DOT) com>
wrote:
Quote:
On 2011-01-04 01:37, jr wrote:
[...]

2.
I could do this query in Oracle but not in MySQL.
You cannot delete from the same table as in the subquery.

*DELETE FROM ci *b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING *COUNT(1)>1
) a
where *a.bu=b.bu
and a.ndc = b.ndc

You can hide the table at a deeper nesting level, but it will probably
be very slow. Another option is to delete from a join (non-standard)
like (untested):

delete b.*
from ci as b
join (
* SELECT bu,ndc
* FROM ci
* GROUP BY bu,ndc
* HAVING *COUNT(1)>1
) as a
* * on a.bu=b.bu
* *and a.ndc = b.ndc

If bu, ndu is a candidate key, you might want to add a unique constraint:

* * alter table ci add constraint ... unique (bu, ndc);

to enforce this.

[...]

/Lennart
The query is beautiful although it deleted 1085 rows. When I did the
first query to locate duplicates it found only 487. Why is
that? ?????
Where did the extra
This is the query I used to find the dupes. Is that wrong or is your
query wrong? Did it delete the dupe and the original record as well?

select bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,nationaldrugcode)) > 1



I wouldn't hide the table at a deeper level because users are using
the web page and it is somewhat slow already because of a UNION
query. I could try it if all else fails. I will try the unique
constraint.

thanks,

Reply With Quote
  #7  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-04-2011 , 03:19 PM



On 2011-01-04 16:37, jr wrote:
[...]
Quote:
The query is beautiful although it deleted 1085 rows. When I did the
first query to locate duplicates it found only 487. Why is
that? ?????
Where did the extra
This is the query I used to find the dupes. Is that wrong or is your
query wrong? Did it delete the dupe and the original record as well?

select bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,nationaldrugcode)) > 1


I should have look more carefully, I just translated your attempt:

DELETE FROM ci b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

into

delete b.*
from ci as b
join (
SELECT bu,ndc
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) as a
on a.bu=b.bu
and a.ndc = b.ndc

It is probably not what you want because it will delete *all* rows that
are duplicated (including the first occurrence). Compare the result from:

SELECT *
FROM ci b
WHERE EXISTS (
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

with:

select b.*
from ci as b
join (
SELECT bu,ndc
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) as a
on a.bu=b.bu
and a.ndc = b.ndc

Quote:
I wouldn't hide the table at a deeper level because users are using
the web page and it is somewhat slow already because of a UNION
query. I could try it if all else fails.
No, you misunderstood me. What I meant was that you can hide the table
inside the delete stmt as in:

SELECT *
FROM ci b
WHERE EXISTS (
select * FROM (
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) AS X
) a
where a.bu=b.bu
and a.ndc = b.ndc

but this will also delete more rows than you want (I guess). You will
have to enumerate the duplicate rows somehow (the sequential unique id
is probably your best option). Here's one attempt:

delete b.*
from ci as b
where exists (
select 1
from ci as c
where c.bu = b.bu
and c.ndc = b.ndc
and c.id < b.id
)

But mysql does not allow this, so we'll rewrite that as a join:

delete b.*
from ci as b
join ci as c
on c.bu = b.bu
and c.ndc = b.ndc
and c.id < b.id

Example:

create table ci (
id int not null primary key,
bu int not null,
ndc int not null
) engine = innodb;

insert into ci (id, bu, ndc)
values (1,1,1),(2,1,1),(3,2,1),(4,2,2),(5,2,2),(6,1,1);

The first query will delete *all* duplicates, i.e. all rows but the one
with id = 3:

mysql> select b.*
-> from ci as b
-> join (
-> SELECT bu,ndc
-> FROM ci
-> GROUP BY bu,ndc
-> HAVING COUNT(1)>1
-> ) as a
-> on a.bu=b.bu
-> and a.ndc = b.ndc;
+----+----+-----+
Quote:
id | bu | ndc |
+----+----+-----+
1 | 1 | 1 |
2 | 1 | 1 |
4 | 2 | 2 |
5 | 2 | 2 |
6 | 1 | 1 |
+----+----+-----+

The rewritten query leaves the duplicates with the lowest id:

mysql> select b.* from ci as b
-> join ci as c
-> on c.bu = b.bu
-> and c.ndc = b.ndc
-> and c.id < b.id;
+----+----+-----+
Quote:
id | bu | ndc |
+----+----+-----+
2 | 1 | 1 |
5 | 2 | 2 |
6 | 1 | 1 |
6 | 1 | 1 |
+----+----+-----+
4 rows in set (0.05 sec)

mysql> delete b.*
-> from ci as b
-> join ci as c
-> on c.bu = b.bu
-> and c.ndc = b.ndc
-> and c.id < b.id;
Query OK, 3 rows affected (0.05 sec)

mysql> select * from ci;
+----+----+-----+
Quote:
id | bu | ndc |
+----+----+-----+
1 | 1 | 1 |
3 | 2 | 1 |
4 | 2 | 2 |
+----+----+-----+
3 rows in set (0.00 sec)

Don't just take the code and execute it, try understand the similarities
and differences from the first attempt.

HTH
/Lennart

Reply With Quote
  #8  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-04-2011 , 03:38 PM



On 2011-01-04 22:19, Lennart Jonsson wrote:
[...]
Quote:
SELECT *
FROM ci b
WHERE EXISTS (
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

I get an syntax error here, so I guess uncorrelated exists subqueries
are not allowed either. You can rewrite this to:

SELECT b.* FROM ci as b where exists ( select bu,ndc from ci as a where
a.bu = b.bu and a.ndc = b.ndc group by bu,ndc having count(1) > 1);
+----+----+-----+
Quote:
id | bu | ndc |
+----+----+-----+
1 | 1 | 1 |
2 | 1 | 1 |
4 | 2 | 2 |
5 | 2 | 2 |
6 | 1 | 1 |
+----+----+-----+

/Lennart

Reply With Quote
  #9  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-05-2011 , 05:53 PM



On 2011-01-04 22:38, Lennart Jonsson wrote:
Quote:
On 2011-01-04 22:19, Lennart Jonsson wrote:
[...]

SELECT *
FROM ci b
WHERE EXISTS (
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc


I get an syntax error here
Never mind, there where several other reasons for the syntax error, I
just did not notice them.

Reply With Quote
  #10  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-06-2011 , 07:55 AM



On Jan 5, 3:53*pm, Lennart Jonsson <erik.lennart.jons... (AT) gmail (DOT) com>
wrote:
Quote:
On 2011-01-04 22:38, Lennart Jonsson wrote:









On 2011-01-04 22:19, Lennart Jonsson wrote:
[...]

SELECT *
FROM ci *b
WHERE EXISTS (
* * SELECT bu,ndc, count(1)
* * FROM ci
* * GROUP BY bu,ndc
* * HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

I get an syntax error here

Never mind, there where several other reasons for the syntax error, I
just did not notice them.
I do get it, thank, you finished the first query that I was trying to
do even though it was wrong.
If I could have done it then I could have seen it was wrong. I did
want to do the self-join.
I get it you can't delete from the same table as a subquery. Thanks
for explaining the rest of it.

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.