dbTalk Databases Forums  

SQL - Indexing for performance on uniquness check...

comp.databases.postgresql.novice comp.databases.postgresql.novice


Discuss SQL - Indexing for performance on uniquness check... in the comp.databases.postgresql.novice forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Loftis, Charles E
 
Posts: n/a

Default SQL - Indexing for performance on uniquness check... - 07-18-2004 , 02:20 PM






When trying to find duplicates on an table how I need to know how index the
table to optimize performance.
Should there be an index for each attribute (A1, A2, ..., An) in the GROUP
BY or should there be one multi-attribute index on all the grouping
attributes.

Assume the table has more attributes than those attributes being GROUPed on.
Also, assume all attributes are of type varchar.

Sample query to return non-uniqueness
SELECT A1, A2, A3, ..., An
FROM Table
GROUP BY A1, A2, A3, ..., An
HAVING Count(*)>1

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo (AT) postgresql (DOT) org


Reply With Quote
  #2  
Old   
Josh Berkus
 
Posts: n/a

Default Re: SQL - Indexing for performance on uniquness check... - 07-18-2004 , 03:09 PM






Charles,

Quote:
Assume the table has more attributes than those attributes being GROUPed
on. Also, assume all attributes are of type varchar.

Sample query to return non-uniqueness
SELECT A1, A2, A3, ..., An
FROM Table
GROUP BY A1, A2, A3, ..., An
HAVING Count(*)>1
In order for it to be even possible to use an index (a hashaggregate
operation, actually) on this table, you'd have to include *all* of the GROUP
BY columns in a single, multi-column index.

However, it would be unlikely for PG to use any kind of an index in the
operation above, because of the number of columns, the unlikelyness of
grouping (i.e. there will only be a minority of rows with count(*) > 1) and
the fact that you're running this against the whole table. So any kind of an
index is liable to be useless.

If the table is so large that you *have* to use an index or it takes
absolutely forever to run, then you may wish to try different operations to
detect duplicate rows. For example, if it could be assumed that there was a
column Ax which was not included as a unique identifier, and could be counted
on to be both (a) not null and (b) different for duplicate rows (a timestamp
for example), then you could do:

SELECT A1, A2, A3, ... An
FROM table
WHERE EXISTS (SELECT 1
FROM table t2
WHERE t2.A1 = table.A1
AND t2.A2 = table.A2
...
AND t2.An = table.An
AND t2.Ax <> table.Ax
);

This can be a much better structure for indexed searches because an index on
some-but-not-all of the columns A1 ... An can be used for the EXISTS join,
such as an index on A1, A2, A3.

Of course, if this is an import table, where some of the rows are *exact*
duplicates, the above doesn't help. On the other hand, if the rows *are*
exact duplicates, why do you care? Just do SELECT DISTINCT on transfer and
eliminate them.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Reply With Quote
  #3  
Old   
Tom Lane
 
Posts: n/a

Default Re: SQL - Indexing for performance on uniquness check... - 07-18-2004 , 11:05 PM



Josh Berkus <josh (AT) agliodbs (DOT) com> writes:
Quote:
Charles,
Sample query to return non-uniqueness
SELECT A1, A2, A3, ..., An
FROM Table
GROUP BY A1, A2, A3, ..., An
HAVING Count(*)>1

In order for it to be even possible to use an index (a hashaggregate
operation, actually) on this table, you'd have to include *all* of the GROUP
BY columns in a single, multi-column index.

However, it would be unlikely for PG to use any kind of an index in the
operation above, because of the number of columns, the unlikelyness of
grouping (i.e. there will only be a minority of rows with count(*) > 1) and
the fact that you're running this against the whole table. So any
kind of an index is liable to be useless.
Yeah. If you are not expecting a huge number of groups, I think that it
would be more interesting to try a HashAggregate plan than a sort/group
plan. For this you need 7.4 or later and a sort_mem setting large
enough to cover whatever the planner estimates the hashtable size to be.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.