dbTalk Databases Forums  

data cleansing table needs dupes removed

comp.databases.mysql comp.databases.mysql


Discuss data cleansing table needs dupes removed in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Luuk
 
Posts: n/a

Default Re: data cleansing table needs dupes removed - 11-13-2010 , 10:19 AM






On 12-11-10 20:15, jr wrote:
Quote:
my ci2 table: id,businessunit, zonenumber,zoneid, ndc
There are 874 records that have 1 or more duplicates across zones
with the 4 columns in the group by making it unique. I only need to
order 1 per business unit, but the zones have more than one ndc
across the zones.
select * FROM ci2 WHERE ndc is not null
GROUP BY businessunit,zonenm,zoneid,ndc
HAVING ( COUNT(*)> 1 )

How to change the select into a delete?

I am reading on how to create a temp table. How do I get the unique
records into the temp table? Then I drop the real table.

tnx,


Your query is basically the same as:

SELECT * FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

And this can be rewritten as:
DELETE FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

(untested, please backup your data first ;-)

--
Luuk

Reply With Quote
  #12  
Old   
jr
 
Posts: n/a

Default Re: data cleansing table needs dupes removed - 11-13-2010 , 10:42 AM






On Nov 13, 8:19*am, Luuk <L... (AT) invalid (DOT) lan> wrote:
Quote:
On 12-11-10 20:15, jr wrote:



my ci2 *table: *id,businessunit, zonenumber,zoneid, ndc
There are 874 records that have 1 or more duplicates *across zones
with the 4 columns in the group by making it unique. *I only need to
order *1 per business unit, but the zones have more than one ndc
across the zones.
select * FROM ci2 WHERE ndc is not null
GROUP BY businessunit,zonenm,zoneid,ndc
HAVING ( COUNT(*)> *1 )

How to change the select into a delete?

I am reading on how to create a temp table. *How do I get the unique
records into the temp table? *Then I drop the real table.

tnx,

Your query is basically the same as:

SELECT * FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

And this can be rewritten as:
DELETE FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

(untested, please backup your data first ;-)

--
Luuk
tnx,

Reply With Quote
  #13  
Old   
jr
 
Posts: n/a

Default Re: data cleansing table needs dupes removed - 11-17-2010 , 02:42 PM



On Nov 13, 8:19*am, Luuk <L... (AT) invalid (DOT) lan> wrote:
Quote:
On 12-11-10 20:15, jr wrote:









my ci2 *table: *id,businessunit, zonenumber,zoneid, ndc
There are 874 records that have 1 or more duplicates *across zones
with the 4 columns in the group by making it unique. *I only need to
order *1 per business unit, but the zones have more than one ndc
across the zones.
select * FROM ci2 WHERE ndc is not null
GROUP BY businessunit,zonenm,zoneid,ndc
HAVING ( COUNT(*)> *1 )

How to change the select into a delete?

I am reading on how to create a temp table. *How do I get the unique
records into the temp table? *Then I drop the real table.

tnx,

Your query is basically the same as:

SELECT * FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

And this can be rewritten as:
DELETE FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

Hello, I still need the answer on this. Is this right?

SELECT businessunit,zonenm,zoneid,sku,ndc, count(*)
from ci2
group by businessunit,ndc
having COUNT(concat(bu,ndc) > 1)

It is not doing what I expect.
The count column has mostly 1s, but shows some 2s (dupes)
but I need all of those > 1 in a aggregate list.

thanks,

Reply With Quote
  #14  
Old   
Peter H. Coffin
 
Posts: n/a

Default Re: data cleansing table needs dupes removed - 11-17-2010 , 07:55 PM



On Wed, 17 Nov 2010 12:42:09 -0800 (PST), jr wrote:
Quote:
On Nov 13, 8:19*am, Luuk <L... (AT) invalid (DOT) lan> wrote:
On 12-11-10 20:15, jr wrote:

my ci2 *table: *id,businessunit, zonenumber,zoneid, ndc
There are 874 records that have 1 or more duplicates *across zones
with the 4 columns in the group by making it unique. *I only need to
order *1 per business unit, but the zones have more than one ndc
across the zones.
select * FROM ci2 WHERE ndc is not null
GROUP BY businessunit,zonenm,zoneid,ndc
HAVING ( COUNT(*)> *1 )

How to change the select into a delete?

I am reading on how to create a temp table. *How do I get the unique
records into the temp table? *Then I drop the real table.

tnx,

Your query is basically the same as:

SELECT * FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

And this can be rewritten as:
DELETE FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1
I do not think this will do what the poster wants it to. That will not
delete duplicates in the sense that there will be only one row of the
duplicates left. That will delete ALL the rows that have the duplicate
values, leaving no row that has those previously-duplicated values.

Quote:
Hello, I still need the answer on this. Is this right?

SELECT businessunit,zonenm,zoneid,sku,ndc, count(*)
from ci2
group by businessunit,ndc
having COUNT(concat(bu,ndc) > 1)

It is not doing what I expect.
That is one of the best indications that it is not right.

Quote:
The count column has mostly 1s, but shows some 2s (dupes)
but I need all of those > 1 in a aggregate list.

thanks,
The case where *all* the values are duplicated is trivial. That's what
SELECT DISTINCT is for. Do that into a copy of the table, delete the
old one, rename the new, and you're done.

The problem ones are where you have some values that are duplicate and
some that are not. They're duplicate by the values that matter to you,
or you want to save some particular quality above others. Maybe you want
to save the oldest one by a timestamp, or the highest value by customer
number. In this example, we will save by the highest value in rowid.

I'm going to do a real example. Sadly, this means I have to use my data,
not yours, since I want to actually do the operations and show what's
happening.

============================================
$ mysql -u hellsop -p test
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 149646
Server version: 5.0.33-log OpenBSD port: mysql-server-5.0.33

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from trashheap2;
+-------+--------------+----------+---------------------------------+
Quote:
rowid | parent_rowid | label | data |
+-------+--------------+----------+---------------------------------+
1 | NULL | root | this is the document root |
2 | 1 | layer 1 | This is the first layer |
3 | 1 | layer 2 | This is the second layer |
4 | 3 | item 3 | This is also at the first layer |
5 | NULL | sqlyog | ????? |
6 | NULL | japanese | ????? |
7 | 3 | item 3 | This is also at the first layer |
8 | 3 | item 3 | This is also at the first layer |
+-------+--------------+----------+---------------------------------+
8 rows in set (0.00 sec)
============================================

Nice table, yes? Manageable. Has a couple of traps for the unwary. If
you are seeing it in my terminal, it's a pleasing aqua color too. Pay
no attention to the contents; they are unimportant. The rowid is not
going to be considered a duplicate value, since there are none. But we
will want to keep the highest of those, to prove that we're actually in
control of the process.

Here is how to get the rows that we want to keep:

============================================
mysql> SELECT a.* FROM trashheap2 a
-> JOIN (SELECT *, MAX(rowid) AS max_id
-> FROM trashheap2
-> GROUP BY parent_rowid, label, `data`) b
-> ON (a.parent_rowid = b.parent_rowid OR
-> (a.parent_rowid IS NULL AND b.parent_rowid IS NULL)) AND
-> a.label = b.label AND
-> a.`data` = b.`data` AND
-> a.rowid = b.max_id
-> ORDER BY rowid;
+-------+--------------+----------+---------------------------------+
Quote:
rowid | parent_rowid | label | data |
+-------+--------------+----------+---------------------------------+
1 | NULL | root | this is the document root |
2 | 1 | layer 1 | This is the first layer |
3 | 1 | layer 2 | This is the second layer |
5 | NULL | sqlyog | ????? |
6 | NULL | japanese | ????? |
8 | 3 | item 3 | This is also at the first layer |
+-------+--------------+----------+---------------------------------+
6 rows in set (0.02 sec)
============================================

Let's look at the query.

SELECT a.* FROM trashheap2 a
-- Here we join the table to a phantom copy of itself.
-- max_id is how we want to pick which row to keep.
JOIN (SELECT *, MAX(rowid) AS max_id
FROM trashheap2
GROUP BY parent_rowid, label, `data`) b
-- First, we need to connect it all up by our duplicate columns
ON (a.parent_rowid = b.parent_rowid OR
-- (We have to pull a few shenanigans here because joining on null
-- values doesn't actually match in a join. Comparing NULL to NULL
-- returns NULL, not TRUE...)
(a.parent_rowid IS NULL AND b.parent_rowid IS NULL)) AND
a.label = b.label AND
a.`data` = b.`data` AND
-- And here we set our criteria for which row of the duplicate ones
-- we want to keep.
a.rowid = b.max_id
ORDER BY rowid;

Now, if you feel lazy and have the storage to do it, you can select that
into a fresh table like the SELECT DISTINCT one above. The drawback is
that it kind of depends on being able to interrupt things long enough do
all this table swapping. If you can't, you pull the same trick again:
You build a delete query that joins to the above query, and delete the
rows where the second subquery result is NULL. That means you have
opened the same table a total of three times in the same query, have
nested subqueries two deep and have four table aliases...

DELETE mstr
FROM trashheap2 mstr
LEFT OUTER JOIN (
SELECT a.*, b.max_id FROM trashheap2 a
JOIN (SELECT *, COUNT(*) AS cnt, MAX(rowid) AS max_id
FROM trashheap2
GROUP BY parent_rowid, label, `data`) b
ON (a.parent_rowid = b.parent_rowid OR
(a.parent_rowid IS NULL AND b.parent_rowid IS NULL)) AND
a.label = b.label AND
a.`data` = b.`data` AND
a.rowid = b.max_id
ORDER BY rowid
) sub
ON mstr.rowid = sub.rowid AND
-- same NULL shenanigans here again....
(mstr.parent_rowid = sub.parent_rowid OR
(mstr.parent_rowid IS NULL AND sub.parent_rowid IS NULL)) AND
mstr.label = sub.label AND
mstr.`data` = sub.`data`
WHERE sub.rowid IS NULL ;

And, sure enough, this works how we want it to:

============================================
mysql> DELETE mstr
-> FROM trashheap2 mstr
-> LEFT OUTER JOIN (
-> SELECT a.*, b.max_id FROM trashheap2 a
-> JOIN (SELECT *, COUNT(*) AS cnt, MAX(rowid) AS max_id
-> FROM trashheap2
-> GROUP BY parent_rowid, label, `data`) b
-> ON (a.parent_rowid = b.parent_rowid OR
-> (a.parent_rowid IS NULL AND b.parent_rowid IS NULL)) AND
-> a.label = b.label AND
-> a.`data` = b.`data` AND
-> a.rowid = b.max_id
-> ORDER BY rowid
-> ) sub
-> ON mstr.rowid = sub.rowid AND
-> (mstr.parent_rowid = sub.parent_rowid OR
-> (mstr.parent_rowid IS NULL AND sub.parent_rowid IS NULL)) AND
-> mstr.label = sub.label AND
-> mstr.`data` = sub.`data`
-> WHERE sub.rowid IS NULL ;
Query OK, 2 rows affected (0.02 sec)

mysql> SELECT * FROM trashheap2;
+-------+--------------+----------+---------------------------------+
Quote:
rowid | parent_rowid | label | data |
+-------+--------------+----------+---------------------------------+
1 | NULL | root | this is the document root |
2 | 1 | layer 1 | This is the first layer |
3 | 1 | layer 2 | This is the second layer |
5 | NULL | sqlyog | ????? |
6 | NULL | japanese | ????? |
8 | 3 | item 3 | This is also at the first layer |
+-------+--------------+----------+---------------------------------+
6 rows in set (0.01 sec)
============================================

Good luck!

--
Liberty, equality, diversity. Pick any two.

Reply With Quote
  #15  
Old   
Luuk
 
Posts: n/a

Default Re: data cleansing table needs dupes removed - 11-18-2010 , 02:19 AM



On 18-11-10 02:55, Peter H. Coffin wrote:
Quote:
Your query is basically the same as:

SELECT * FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1

And this can be rewritten as:
DELETE FROM ci2
WHERE ndc is not null
AND (SELECT count(*) from ci2 GROUP BY businessunit,zonenm,zoneid,ndc)>1
I do not think this will do what the poster wants it to. That will not
delete duplicates in the sense that there will be only one row of the
duplicates left. That will delete ALL the rows that have the duplicate
values, leaving no row that has those previously-duplicated values.

you are correct, my query is wrong....

--
Luuk

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.