dbTalk Databases Forums  

COUNT efficiently?

comp.databases comp.databases


Discuss COUNT efficiently? in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Steve
 
Posts: n/a

Default COUNT efficiently? - 02-17-2004 , 10:30 PM






Hi,

I have a huge database and I need to frequently do a COUNT() on a column
to see how many rows there are in the table. Sometimes I need to count
rows based on some condition but mostly it's just counting the total
rows in the table. It takes forever to do this and (I'm using
postgreSQL) when I try doing a "EXPLAIN" on the query, it says it's
going to 'sequentially' scan through each the table. Is there some
efficient way of doing a COUNT? Can someone please help me out with
this? What do people do when their tables have billions of rows? Thanks,

Steve


Reply With Quote
  #2  
Old   
Robert Klemme
 
Posts: n/a

Default Re: COUNT efficiently? - 02-18-2004 , 02:36 AM







"Steve" <nospam@nopes> schrieb im Newsbeitrag
news:4032ea88$1 (AT) clarion (DOT) carno.net.au...
Quote:
Hi,

I have a huge database and I need to frequently do a COUNT() on a column
to see how many rows there are in the table. Sometimes I need to count
rows based on some condition but mostly it's just counting the total
rows in the table. It takes forever to do this and (I'm using
postgreSQL) when I try doing a "EXPLAIN" on the query, it says it's
going to 'sequentially' scan through each the table. Is there some
efficient way of doing a COUNT? Can someone please help me out with
this? What do people do when their tables have billions of rows? Thanks,
I'd say you need at least one index expecially for those counts where
there is a criterion. Did you create an index that contains all fields
used in the criterion? The index might even help for counting all. It
might be better then to use "select count(colum_in_index) from
your_table".

robert



Reply With Quote
  #3  
Old   
Christopher Browne
 
Posts: n/a

Default Re: COUNT efficiently? - 02-18-2004 , 07:08 AM



Centuries ago, Nostradamus foresaw when "Robert Klemme" <bob.news (AT) gmx (DOT) net> would write:
Quote:
"Steve" <nospam@nopes> schrieb im Newsbeitrag
news:4032ea88$1 (AT) clarion (DOT) carno.net.au...
Hi,

I have a huge database and I need to frequently do a COUNT() on a
column to see how many rows there are in the table. Sometimes I
need to count rows based on some condition but mostly it's just
counting the total rows in the table. It takes forever to do this
and (I'm using postgreSQL) when I try doing a "EXPLAIN" on the
query, it says it's going to 'sequentially' scan through each the
table. Is there some efficient way of doing a COUNT? Can someone
please help me out with this? What do people do when their tables
have billions of rows? Thanks,

I'd say you need at least one index expecially for those counts
where there is a criterion. Did you create an index that contains
all fields used in the criterion? The index might even help for
counting all. It might be better then to use "select
count(colum_in_index) from your_table".
An index would help with counts where the criterion is reasonably
selective. (Is there an improvement in using the index if the 40% of
the table that you're counting means you still have to walk through
~95% of the pages? No, there's not...)

Unfortunately, the MVCC system that improves performance for
multitudes of concurrent users means that there _isn't_ a ready
automatic way of keeping precise statistics on COUNT(*) on the table.
Oracle and DB/2 have similar problems, these days.

I suppose that someone (perhaps I) will have to see about implementing
the trigger-based scheme for having counters for tables. The
essential idea:

Each table that is to be thus tracked has a trigger added...

create trigger add_item_to_x
on insert to the_big_table
do add_count_for_x();

create trigger drop_item_to_x
on delete from the_big_table
do sub_count_for_x();

Each time rows are inserted/deleted, a row is inserted into table
COUNT_SUMMARY indicating how many rows were added/removed.

Instead of doing "select count(*) from the_big_table", you instead do:

select sum(net_rows) from count_summary where id = [ID for
the_big_table];

A daemon goes through periodically to compress the table by grouping
together multiple count_summary rows into just 1.
--
output = reverse("gro.mca" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/wp.html
Rules of the Evil Overlord #81. "If I am fighting with the hero atop a
moving platform, have disarmed him, and am about to finish him off and
he glances behind me and drops flat, I too will drop flat instead of
quizzically turning around to find out what he saw."
<http://www.eviloverlord.com/>


Reply With Quote
  #4  
Old   
Alfredo Novoa
 
Posts: n/a

Default Re: COUNT efficiently? - 02-18-2004 , 09:46 AM



"Robert Klemme" <bob.news (AT) gmx (DOT) net> wrote


Hi

Quote:
I have a huge database and I need to frequently do a COUNT() on a column
to see how many rows there are in the table. Sometimes I need to count
rows based on some condition but mostly it's just counting the total
rows in the table. It takes forever to do this and (I'm using
postgreSQL) when I try doing a "EXPLAIN" on the query, it says it's
going to 'sequentially' scan through each the table.
It seems that such DBMS does not mantain the number of tuples of each
table in the catalog's statistics

Quote:
Is there some
efficient way of doing a COUNT? Can someone please help me out with
this? What do people do when their tables have billions of rows? Thanks,
I suposse that they use more powerful DBMSs like Oracle and DB2.

Quote:
I'd say you need at least one index expecially for those counts where
there is a criterion.
Snapshots (also called materialized views) could be useful for that.
But as far as I know PostgresSQL still does not support snapshots


Regards
Alfredo


Reply With Quote
  #5  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: COUNT efficiently? - 02-18-2004 , 01:10 PM



Christopher Browne <cbbrowne (AT) acm (DOT) org> wrote

Quote:
Centuries ago, Nostradamus foresaw when "Robert Klemme" <bob.news (AT) gmx (DOT) net> would write:
"Steve" <nospam@nopes> schrieb im Newsbeitrag
news:4032ea88$1 (AT) clarion (DOT) carno.net.au...
Hi,
[...]

Quote:
Unfortunately, the MVCC system that improves performance for
multitudes of concurrent users means that there _isn't_ a ready
automatic way of keeping precise statistics on COUNT(*) on the table.
Oracle and DB/2 have similar problems, these days.

I suppose that someone (perhaps I) will have to see about implementing
the trigger-based scheme for having counters for tables. The
essential idea:

Hi Christopher, judging from earlier posts you seem to have a lot of
knowledge and insight about postgres. I've been trying to find out if
there are any plans for snapshots (or summary tables) in postgres, but
havent found any answers regarding this. Someone mentioned that the
locking mechanism used by postgres makes it difficult to implement. Do
you know if there is any truth in this? Also, difficult is not the
same as impossible, so are there any plans for such?


Kind regards
/Lennart


Reply With Quote
  #6  
Old   
Christopher Browne
 
Posts: n/a

Default Re: COUNT efficiently? - 02-20-2004 , 10:00 AM



Centuries ago, Nostradamus foresaw when lennart (AT) kommunicera (DOT) umea.se (Lennart Jonsson) would write:
Quote:
Christopher Browne <cbbrowne (AT) acm (DOT) org> wrote

Centuries ago, Nostradamus foresaw when "Robert Klemme" <bob.news (AT) gmx (DOT) net> would write:
"Steve" <nospam@nopes> schrieb im Newsbeitrag
news:4032ea88$1 (AT) clarion (DOT) carno.net.au...
Hi,

[...]

Unfortunately, the MVCC system that improves performance for
multitudes of concurrent users means that there _isn't_ a ready
automatic way of keeping precise statistics on COUNT(*) on the table.
Oracle and DB/2 have similar problems, these days.

I suppose that someone (perhaps I) will have to see about implementing
the trigger-based scheme for having counters for tables. The
essential idea:

Hi Christopher, judging from earlier posts you seem to have a lot of
knowledge and insight about postgres. I've been trying to find out if
there are any plans for snapshots (or summary tables) in postgres, but
havent found any answers regarding this. Someone mentioned that the
locking mechanism used by postgres makes it difficult to implement. Do
you know if there is any truth in this? Also, difficult is not the
same as impossible, so are there any plans for such?
Can you elaborate by what you mean by that?

First guess is that you're thinking of something akin to materialized
views, perhaps "materialized aggregate views." The challenges with
that aren't about locking mechanism, but rather with doing all the
work about three times. (Once, in the source table, once, to figure
out what to update in the "child" tables, and, in effect, once to
actually put the updates into the "child" tables.)

But it's quite possible that I am guessing badly.

A coworker is working on things akin to this; the answers are still
emerging. His approach has been to have "loader" programs in Java
that generate summary information as it loads in the details. That
amounts to doing the work twice.

I have kept harping on the notion that the best idea is to start by
generating views that calculate summary information, and tuning that
when it turns out to be truly necessary to do so. You can't readily
know, a priori, what parts are going to be slow and hence need
optimization, so there's little sense in optimizing until you know
what parts are slow and need it. That's certainly not the same thing
as "locking problems," so it is possible that I'm on a different
page...
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','cbbrowne.com').
http://www.ntlug.org/~cbbrowne/spreadsheets.html
Rules of the Evil Overlord #211. "If my chief engineer displeases me,
he will be shot, not imprisoned in the dungeon or beyond the traps he
helped design." <http://www.eviloverlord.com/>


Reply With Quote
  #7  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: COUNT efficiently? - 02-20-2004 , 01:44 PM



Christopher Browne <cbbrowne (AT) acm (DOT) org> wrote

Quote:
Centuries ago, Nostradamus foresaw when lennart (AT) kommunicera (DOT) umea.se (Lennart Jonsson) would write:
Christopher Browne <cbbrowne (AT) acm (DOT) org> wrote

Centuries ago, Nostradamus foresaw when "Robert Klemme" <bob.news (AT) gmx (DOT) net> would write:
"Steve" <nospam@nopes> schrieb im Newsbeitrag
news:4032ea88$1 (AT) clarion (DOT) carno.net.au...
Hi,

[...]

Quote:
Hi Christopher, judging from earlier posts you seem to have a lot of
knowledge and insight about postgres. I've been trying to find out if
there are any plans for snapshots (or summary tables) in postgres, but
havent found any answers regarding this. Someone mentioned that the
locking mechanism used by postgres makes it difficult to implement. Do
you know if there is any truth in this? Also, difficult is not the
same as impossible, so are there any plans for such?

Can you elaborate by what you mean by that?

First guess is that you're thinking of something akin to materialized
views, perhaps "materialized aggregate views." The challenges with
that aren't about locking mechanism, but rather with doing all the
work about three times. (Once, in the source table, once, to figure
out what to update in the "child" tables, and, in effect, once to
actually put the updates into the "child" tables.)

But it's quite possible that I am guessing badly.

Not that badly. Yes I'm thinking about "materialized aggregate views"
(I'm somewhat colored by DB2, which is why I used the term summary
table). Reason I ask is because I find them very nifty from time to
time. Unfortenate DB2 is a bit to expensive for my "hobbie" projects,
so I'm turning towards postgres in those cases.

Quote:
A coworker is working on things akin to this; the answers are still
emerging. His approach has been to have "loader" programs in Java
that generate summary information as it loads in the details. That
amounts to doing the work twice.

I have kept harping on the notion that the best idea is to start by
generating views that calculate summary information, and tuning that
when it turns out to be truly necessary to do so. You can't readily
know, a priori, what parts are going to be slow and hence need
optimization, so there's little sense in optimizing until you know
what parts are slow and need it.
In general I agree. Still it is much easier to transform a view into a
summary table when there is a need for optimization, than to start
mucking around with triggers and such to update "aggregate values" in
"manual summary tables". The summary table in db2 can also be
specified such that the optimizer can use it. Thus, by creating a
summary table a critical query can run faster without rewriting it.

The other side of the coin is of course that you have to be carefull
on how you define summary tables inorder to avoid locking problems.


Quote:
That's certainly not the same thing
as "locking problems," so it is possible that I'm on a different
page...
I only got a fragment of an explanation for what he meant. I was a bit
surprised
, but since I didnt know very much about postgres nor locking
mechanisms in general, I dropped the question and coped with that they
dont exist.

Thanx for your answers

/Lennart


Reply With Quote
  #8  
Old   
Christopher Browne
 
Posts: n/a

Default Re: COUNT efficiently? - 02-20-2004 , 10:30 PM



Clinging to sanity, lennart (AT) kommunicera (DOT) umea.se (Lennart Jonsson) mumbled into her beard:
Quote:
In general I agree. Still it is much easier to transform a view into
a summary table when there is a need for optimization, than to start
mucking around with triggers and such to update "aggregate values"
in "manual summary tables". The summary table in db2 can also be
specified such that the optimizer can use it. Thus, by creating a
summary table a critical query can run faster without rewriting it.
There's a project ongoing for this...
<http://gborg.postgresql.org/project/matview/projdisplay.php>

Not clear that it will be "complete" terribly soon, but it at least
provides a place to look for answers...
--
"cbbrowne","@","cbbrowne.com"
http://cbbrowne.com/info/linux.html
If you have a procedure with ten parameters, you probably missed some.
-- Alan J. Perlis


Reply With Quote
  #9  
Old   
Lennart Jonsson
 
Posts: n/a

Default Re: COUNT efficiently? - 02-21-2004 , 04:44 AM



Christopher Browne <cbbrowne (AT) acm (DOT) org> wrote

Quote:
Clinging to sanity, lennart (AT) kommunicera (DOT) umea.se (Lennart Jonsson) mumbled into her beard:
[...]

Quote:
There's a project ongoing for this...
http://gborg.postgresql.org/project/...rojdisplay.php

Not clear that it will be "complete" terribly soon, but it at least
provides a place to look for answers...
Thanx, I'll have a look

/Lennart


Reply With Quote
  #10  
Old   
AK
 
Posts: n/a

Default Re: COUNT efficiently? - 02-21-2004 , 10:11 PM



yoou could implement a staging table, similar to incremental refresh
in DB2 V8.1
Incremental refresh takes care of severe locking issues that used to
prevent from using MQTs with immediate refresh.

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.