gjeff (AT) chariot (DOT) net.au wrote:
Quote:
cntl_data varchar2(15),
cntl_no varchar2(10),
field1 varchar2(60),
field2 varchar2(60),
ETC..
field20 varchar2(60).
cntl_data is a date & time stamp - yy-mm-dd-hh:mm:ss format and is
unique. |
unique - that's good.
Quote:
cntl_no contains an integer value, and is non unique.
field1 to field20 contains text data, and could contain nulls in the
entire field. |
NULLs - possible problem.
Quote:
What I would like to do is to delete all duplicated rows, with same
cntl_no, and field1 to fields20 having he same values, but retaining
the earliest row based on the value in cntl_data. |
Ignoring NULLs for now:
Method 1:
DELETE FROM table WHERE cntl_data IN
(SELECT b.cntl_data from table a, table b
WHERE a.cntl_data < b.cntl_data AND
a.cntl_no = b.cntl_no AND
a.field1 = b.field1 AND
etc....
)
What it does is join to itself, and then match all records having
identical data fields, as you mentioned, then it only returns the rows
where the a.cntl_data does not equal the b.cntl_data (to avoid a row
matching itself).
By using a.cntl_data < b.cntl_data I always make sure that the date in
the 'a' column will be less the one in the 'b' column, then I use the b
column in the select, and use that to delete values.
If cntl_data is not globally unique, then you can also return b.rowid
and delete based on that.
Now to deal with NULL's:
You can maybe use the NVL function to 'de-null' all the data. This is
assuming you have some value you can use as a substitute for null, that
will not otherwise be in the data.
Or, if not, then do this:
( a.field1 = b.field1 OR (a.field1 IS NULL AND b.field1 IS NULL) ) AND
( a.field2 = b.field2 OR (a.field2 IS NULL AND b.field2 IS NULL) ) AND
etc...
Method 2:
SELECT MAX(cntl_data) FROM table
GROUP BY cntl_no, field1, field2, etc.
HAVING COUNT(*) > 1
Then use this as a subquery for the delete. But the problem with this is
that if you have 3 duplicates for a row, only one of them will be
deleted. I suppose you can reverse the logic: use MIN(cntl_data), and
use that list as a list of rows to KEEP (NOT IN (subquery) assuming
cntl_data is never NULL). Or you can run the query multiple times till
it stops deleting stuff.
The main benefit of the first method is more flexibility in what counts
as a duplicate (use whatever logic you like for matching the columns),
with the second method you have to have an exact match for the group by.
-Ariel
PS. I didn't double check the SQL here, so please do some manual sanity
checking before running the code.