dbTalk Databases Forums  

delete dupes with a unique id using a temp table

comp.databases.mysql comp.databases.mysql


Discuss delete dupes with a unique id using a temp table in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-12-2011 , 05:36 PM






On Jan 4, 1:38*pm, Lennart Jonsson <erik.lennart.jons... (AT) gmail (DOT) com>
wrote:
Quote:
On 2011-01-04 22:19, Lennart Jonsson wrote:
[...]



SELECT *
FROM ci *b
WHERE EXISTS (
* * SELECT bu,ndc, count(1)
* * FROM ci
* * GROUP BY bu,ndc
* * HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

I get an syntax error here, so I guess uncorrelated exists subqueries
are not allowed either. You can rewrite this to:

SELECT b.* FROM ci as b where exists ( select bu,ndc from ci as a where
a.bu = b.bu and a.ndc = b.ndc group by *bu,ndc having count(1) > 1);
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
| *1 | *1 | * 1 |
| *2 | *1 | * 1 |
| *4 | *2 | * 2 |
| *5 | *2 | * 2 |
| *6 | *1 | * 1 |
+----+----+-----+

/Lennart
I ran through all your exercises ;-)
I see why my first query didn't work there were 2 WHERE clauses.
The 2nd query of yours deletes all duplicates because it is an equi-
join on bu&ndc.
The 3rd query, I did get a syntax error on the select:
select *
from ci b
where exists(
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
)a
where a.bu=b.bu
and a.ndc=b.ndc

and I also tried it as a delete:
delete * from ci b
where exists(
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
)a
where a.bu=b.bu
and a.ndc=b.ndc,
I got syntax errors. You mentioned some other reason other than a
syntax error?

The last query worked as you said.
joining also on the id.
It worked really well on one pass!!!!!!!!!!!

Reply With Quote
  #12  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-13-2011 , 12:10 AM






On 2011-01-13 00:36, jr wrote:
[...]
Quote:
The 3rd query, I did get a syntax error on the select:
select *
from ci b
where exists(
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
)a
where a.bu=b.bu
and a.ndc=b.ndc

[...]
I got syntax errors. You mentioned some other reason other than a
syntax error?

You already mentioned the two where clauses, fixing that leaves us with:

.... exists (
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
) a

exists is a predicate ( sometimes referred to as propositional function,
you can think of it as an anonymous function returning true/false ) and
you can't give it a name ( there is no point in doing so either ).
Furthermore bu, ndc, etc inside the predicate is not visible outside the
predicate, so you can't reference them like you do in:

where a.bu=b.bu
and a.ndc=b.ndc

You can however reference variables outside of the predicate from within
the predicate, so

select *
from ci b
where exists(
select bu,ndc, count(1)
from ci as a

where a.bu=b.bu
and a.ndc=b.ndc

group by bu, ndc
having count(1)>1
)

is syntactically correct (will still give you the wrong result).


/Lennart

Reply With Quote
  #13  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-17-2011 , 03:41 PM



On Jan 4, 1:19*pm, Lennart Jonsson <erik.lennart.jons... (AT) gmail (DOT) com>
wrote:
Quote:
On 2011-01-04 16:37, jr wrote:
[...]



The query is beautiful although it deleted 1085 rows. * When I did the
first query to locate duplicates it found only 487. *Why is
that? ?????
Where did the extra
This is the query I used to find the dupes. *Is that wrong or is your
query wrong? *Did it delete the dupe and the original record as well?

select bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,nationaldrugcode)) > 1

I should have look more carefully, I just translated your attempt:

DELETE FROM ci *b
WHERE EXISTS ((
SELECT bu,ndc, count(1)
FROM ci
GROUP BY bu,ndc
HAVING *COUNT(1)>1
) a
where *a.bu=b.bu
and a.ndc = b.ndc

into

delete b.*
from ci as b
join (
* SELECT bu,ndc
* FROM ci
* GROUP BY bu,ndc
* HAVING *COUNT(1)>1
) as a
* * on a.bu=b.bu
* *and a.ndc = b.ndc

It is probably not what you want because it will delete *all* rows that
are duplicated (including the first occurrence). Compare the result from:

SELECT *
FROM ci *b
WHERE EXISTS (
* * SELECT bu,ndc, count(1)
* * FROM ci
* * GROUP BY bu,ndc
* * HAVING COUNT(1)>1
) a
where a.bu=b.bu
and a.ndc = b.ndc

with:

select b.*
from ci as b
join (
* SELECT bu,ndc
* FROM ci
* GROUP BY bu,ndc
* HAVING *COUNT(1)>1
) as a
* * on a.bu=b.bu
* *and a.ndc = b.ndc



I wouldn't hide the table at a deeper level because users are using
the web page and it is somewhat slow already because of a UNION
query. *I could try it if all else fails.

No, you misunderstood me. What I meant was that you can hide the table
inside the delete stmt as in:

SELECT *
FROM ci *b
WHERE EXISTS (
* * select * FROM (
* * * * SELECT bu,ndc, count(1)
* * * * FROM ci
* * * * GROUP BY bu,ndc
* * * * HAVING COUNT(1)>1
* * ) AS X
) a
where a.bu=b.bu
and a.ndc = b.ndc

but this will also delete more rows than you want (I guess). You will
have to enumerate the duplicate rows somehow (the sequential unique id
is probably your best option). Here's one attempt:

delete b.*
from ci as b
where exists (
* * select 1
* * from ci as c
* * where c.bu = b.bu
* * * and c.ndc = b.ndc
* * * and c.id < b.id
)

But mysql does not allow this, so we'll rewrite that as a join:

delete b.*
from ci as b
join ci as c
* * on c.bu = b.bu
* * and c.ndc = b.ndc
* * and c.id < b.id

Example:

create table ci (
* * id int not null primary key,
* * bu int not null,
* * ndc int not null
) engine = innodb;

insert into ci (id, bu, ndc)
values (1,1,1),(2,1,1),(3,2,1),(4,2,2),(5,2,2),(6,1,1);

The first query will delete *all* duplicates, i.e. all rows but the one
with id = 3:

mysql> select b.*
* * -> from ci as b
* * -> join (
* * -> * SELECT bu,ndc
* * -> * FROM ci
* * -> * GROUP BY bu,ndc
* * -> * HAVING *COUNT(1)>1
* * -> ) as a
* * -> * * on a.bu=b.bu
* * -> * *and a.ndc = b.ndc;
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
| *1 | *1 | * 1 |
| *2 | *1 | * 1 |
| *4 | *2 | * 2 |
| *5 | *2 | * 2 |
| *6 | *1 | * 1 |
+----+----+-----+

The rewritten query leaves the duplicates with the lowest id:

mysql> select b.* from ci as b
* * -> join ci as c
* * -> * * on c.bu = b.bu
* * -> * * and c.ndc = b.ndc
* * -> * * and c.id < b.id;
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
| *2 | *1 | * 1 |
| *5 | *2 | * 2 |
| *6 | *1 | * 1 |
| *6 | *1 | * 1 |
+----+----+-----+
4 rows in set (0.05 sec)

mysql> delete b.*
* * -> from ci as b
* * -> join ci as c
* * -> * *on c.bu = b.bu
* * -> * and c.ndc = b.ndc
* * -> * and c.id < b.id;
Query OK, 3 rows affected (0.05 sec)

mysql> select * from ci;
+----+----+-----+
| id | bu | ndc |
+----+----+-----+
| *1 | *1 | * 1 |
| *3 | *2 | * 1 |
| *4 | *2 | * 2 |
+----+----+-----+
3 rows in set (0.00 sec)

Don't just take the code and execute it, try understand the similarities
and differences from the first attempt.

HTH
/Lennart
Lennart,
Regarding this last delete query, it worked on your small sample
database but I tried it on my db of 46825 records and it didn't work.
It left me with only 24,571 records. I did a query to get the num of
duplicates and there were only 9996. Most of those were only
duplicates, but there were some triplicates etc. So it deleted way
too many and I had to kill the server eventually. I came up with 3
other queries that does the deleting in passes by comparing a table
with the duplicate ids with a table containing the max(id)s. It
seemed to work except something is wrong there also. It deleted all
the duplicates, triplcates, quadruplicates, etc. It should have only
deleted one of the duplicates with the max(id) and left the original
duplicate.

1.create table ci_tmp1 as
(select id,bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,ndc)) > 1
)

2.create table ci_tmp2 as(select max(c.id) from ci c, ci_tmp1 t
WHERE t.bu=c.bu AND t.ndc=c.ndc
GROUP BY t.bu, t.ndc)

3.DELETE FROM ci WHERE id IN (select * from ci_tmp2)

This delete query got near to the correct amount of duplicates
34,576. If you subtract 9996 from 46825 you get 36,829.
The only problem with this result was I was expecting to have the
triplicates and quadrupleticates left in the ci table and I would have
to run through
the 3 queries again. When I checked the ci table there were no more
duplicates. What I was hoping was that your query results number
would match with
my way of doing it in 3 passes and I would know I could use your query
but now I'm not even sure if my queries are correct? What do you
think?
Regards, Janis
..

Reply With Quote
  #14  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-17-2011 , 03:45 PM



On Jan 12, 10:10*pm, Lennart Jonsson <erik.lennart.jons... (AT) gmail (DOT) com>
wrote:
Quote:
On 2011-01-13 00:36, jr wrote:
[...]









The 3rd query, *I did get a syntax error on the select:
select *
from ci *b
where exists(
select bu,ndc, count(1)
from ci
group by bu, ndc
having count(1)>1
)a
where a.bu=b.bu
and a.ndc=b.ndc

[...]
I got syntax errors. *You mentioned some other reason other than a
syntax error?

You already mentioned the two where clauses, fixing that leaves us with:

... exists (
* * select bu,ndc, count(1)
* * from ci
* * group by bu, ndc
* * having count(1)>1
) a

exists is a predicate ( sometimes referred to as propositional function,
you can think of it as an anonymous function returning true/false ) and
you can't give it a name ( there is no point in doing so either ).
Furthermore bu, ndc, etc inside the predicate is not visible outside the
predicate, so you can't reference them like you do in:

* where a.bu=b.bu
* * and a.ndc=b.ndc

You can however reference variables outside of the predicate from within
the predicate, so

select *
from ci *b
where exists(
* * select bu,ndc, count(1)
* * from ci as a

* * where a.bu=b.bu
* * and a.ndc=b.ndc

* * group by bu, ndc
* * having count(1)>1
)

is syntactically correct (will still give you the wrong result).

/Lennart
thanks,

Reply With Quote
  #15  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-17-2011 , 11:54 PM



On 2011-01-17 22:41, jr wrote:
[...]
Quote:
Lennart,
Regarding this last delete query, it worked on your small sample
database but I tried it on my db of 46825 records and it didn't work.
It left me with only 24,571 records. I did a query to get the num of
duplicates and there were only 9996. Most of those were only
duplicates, but there were some triplicates etc. So it deleted way
too many and I had to kill the server eventually. I came up with 3
other queries that does the deleting in passes by comparing a table
with the duplicate ids with a table containing the max(id)s. It
seemed to work except something is wrong there also. It deleted all
the duplicates, triplcates, quadruplicates, etc. It should have only
deleted one of the duplicates with the max(id) and left the original
duplicate.

1.create table ci_tmp1 as
(select id,bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,ndc)) > 1
)

This query is illegal and will give you random results. It will only
work if id is functionally dependent of bu, ndc. You can prevent this
misbehaviour by enabling ONLY_FULL_GROUP_BY in sql_mode, mysql will then
return an error instead of a wrong result.

/Lennart

[...]

Reply With Quote
  #16  
Old   
jr
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-18-2011 , 11:09 AM



On Jan 17, 9:54*pm, Lennart Jonsson <erik.lennart.jons... (AT) gmail (DOT) com>
wrote:
Quote:
On 2011-01-17 22:41, jr wrote:
[...]









Lennart,
Regarding this last delete query, it worked on your small sample
database but *I tried it on my db of 46825 records and it didn't work..
It left me with only 24,571 records. *I did a query to get the num of
duplicates and there were only 9996. *Most of those were only
duplicates, but there were some triplicates etc. *So it deleted way
too many and I had to kill the server eventually. * I came up with 3
other queries that does the deleting in passes by comparing a table
with the duplicate ids with a table containing the max(id)s. *It
seemed to work except something is wrong there also. It deleted all
the duplicates, triplcates, quadruplicates, etc. *It should have only
deleted one of the duplicates with the max(id) and left the original
duplicate.

1.create table ci_tmp1 as
(select id,bu,ndc, count(*)
from ci
group by bu,ndc
having COUNT(concat(bu,ndc)) > 1
)

This query is illegal and will give you random results. It will only
work if id is functionally dependent of bu, ndc. You can prevent this
misbehaviour by enabling ONLY_FULL_GROUP_BY in sql_mode, mysql will then
return an error instead of a wrong result.

/Lennart

[...]
thanks for explaining it. It was very useful.

Reply With Quote
  #17  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: delete dupes with a unique id using a temp table - 01-18-2011 , 01:25 PM



On 2011-01-18 18:09, jr wrote:
[...]
Quote:
This query is illegal and will give you random results. It will only
work if id is functionally dependent of bu, ndc. You can prevent this
misbehaviour by enabling ONLY_FULL_GROUP_BY in sql_mode, mysql will then
return an error instead of a wrong result.

/Lennart

[...]

thanks for explaining it. It was very useful.
You can verify by:

set sql_mode='';

create table t (
id int not null primary key,
bu int not null,
ndc int not null
) engine = innodb;

insert into t (id, bu, ndc) values (1,1,1),(2,1,1),(3,1,1);

select id, bu, ndc, count(1) from t group by bu, ndc;
+----+----+-----+----------+
Quote:
id | bu | ndc | count(1) |
+----+----+-----+----------+
1 | 1 | 1 | 3 |
+----+----+-----+----------+
1 row in set (0.00 sec)

id might be any one of 1, 2 or 3

/Lennart

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.