dbTalk Databases Forums  

general question about how best to cache expensive query results

comp.databases comp.databases


Discuss general question about how best to cache expensive query results in the comp.databases forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
j.k.
 
Posts: n/a

Default general question about how best to cache expensive query results - 05-02-2007 , 07:12 PM






I asked this question over in mailing.database.pgsql-sql a couple of
days ago and got no answer. Perhaps the question was ill-phrased, or
perhaps that was not the best forum.

I'm reposting my question in here in the hope that someone can give me
some general advice or suggestions for further research.

--

I have a general question about how best to cache query results that
are too expensive to calculate each time they're needed.

The question is that when I have a query like:

SELECT
e.id, e.title
FROM
entries e
JOIN feeds f ON e.feed_id = f.id
WHERE
f.url = 'http://example.com'
AND e.id > (select id from entries where pub_date <
current_timestamp - interval '5 days' order by pub_date desc limit 1);

The last restriction in the query is the one that is too expensive to
calculate every time its needed. It's purpose is to restrict the
entries to those that are newer than a certain date, except that some
entries don't specify a date, so I use the fact that ids are in
ascending order in order of creation and find the highest id that's at
least 5 days old.

In actuality, I don't need the relevant e.id to be updated more than
once a day or so. It definitely does not need to be realtime, and is
prohibitively expensive to calculate the actual value.

What are general approaches to this kind of problem?

One obvious solution to me is to create a table for storing just this
value, but something seems wrong to me about potentially lots of
little 1-column, 1-row tables.

If there are multiple of this values that need to periodically
updated, then they could all be in 1 miscellaneous table, but then I
would have the problem of deciding what to make the primary key for
each row, and it would have to be hard-coded in the update and select
queries to distinguish this cached value from other cached values in
the same table, as well as dealing with values having different types.

I hope my question is clear enough. Thanks for any suggestions about
best strategies for dealing with this kind of issue.


Reply With Quote
  #2  
Old   
Robert Klemme
 
Posts: n/a

Default Re: general question about how best to cache expensive query results - 05-03-2007 , 03:27 AM






On 03.05.2007 02:12, j.k. wrote:
Quote:
I have a general question about how best to cache query results that
are too expensive to calculate each time they're needed.

The question is that when I have a query like:

SELECT
e.id, e.title
FROM
entries e
JOIN feeds f ON e.feed_id = f.id
WHERE
f.url = 'http://example.com'
AND e.id > (select id from entries where pub_date
current_timestamp - interval '5 days' order by pub_date desc limit 1);

The last restriction in the query is the one that is too expensive to
calculate every time its needed. It's purpose is to restrict the
entries to those that are newer than a certain date, except that some
entries don't specify a date, so I use the fact that ids are in
ascending order in order of creation and find the highest id that's at
least 5 days old.
How do you do that if not all records have a pub_date? Consider

id pub_date
1 2007-01-01
2 2007-01-09
3 (NULL)
4 2007-02-01

Now, if your limit is 2007-01-17 you cannot know whether your max id is
3 or 2 - whatever you do, with this data you will always get 2 because
the RDBMS cannot decide whether 3 was before or after 2007-01-17.


First of all, IMHO the sub query can be improved. Because what you
really want is the MAX. So, I'd rather do

-- slightly different semantics, but should work according
-- to your description
select max(id)
from entries
where pub_date < current_timestamp - interval '5 days'

-- or, more like your original, in case ids might not be ordered
select max(id)
from entries, (
select max(pub_date) max_date
from entries
where pub_date < current_timestamp - interval '5 days'
) e
where entries.pub_date = e.max_date

Note, there might be other approaches - probably even more efficient ones.

Then, I'd do proper indexing. (Btw, you do not mention DDL - that
would certainly help.)

Only if it is still slow then I'd consider "caching".

Regards

robert


Reply With Quote
  #3  
Old   
Ed Prochak
 
Posts: n/a

Default Re: general question about how best to cache expensive query results - 05-04-2007 , 07:12 AM



On May 2, 8:12 pm, "j.k." <spaeci... (AT) gmail (DOT) com> wrote:
Quote:
I asked this question over in mailing.database.pgsql-sql a couple of
days ago and got no answer. Perhaps the question was ill-phrased, or
perhaps that was not the best forum.

I'm reposting my question in here in the hope that someone can give me
some general advice or suggestions for further research.

--

I have a general question about how best to cache query results that
are too expensive to calculate each time they're needed.

The question is that when I have a query like:

SELECT
e.id, e.title
FROM
entries e
JOIN feeds f ON e.feed_id = f.id
WHERE
f.url = 'http://example.com'
AND e.id > (select id from entries where pub_date
current_timestamp - interval '5 days' order by pub_date desc limit 1);

The last restriction in the query is the one that is too expensive to
calculate every time its needed. It's purpose is to restrict the
entries to those that are newer than a certain date, except that some
entries don't specify a date, so I use the fact that ids are in
ascending order in order of creation and find the highest id that's at
least 5 days old.

In actuality, I don't need the relevant e.id to be updated more than
once a day or so. It definitely does not need to be realtime, and is
prohibitively expensive to calculate the actual value.

What are general approaches to this kind of problem?
Don't let pub_date be null. Seriously, you are making workarounds for
failing to meet what is the obvious business requirement. Your
assumption about the ID's is bogus because someone could back to fill
in publication date with a value that does not fit your assumption.
You did not mention any guarantee that the date eventually provided
will fit your model.

To use Robert's sample data:

id pub_date
1 2007-01-01
2 2007-01-09
3 (NULL)
4 2007-02-01

what prevents someone from updating id=3 to pub_date=2007-01-06?

So can you provide a default date when a new article is logged in this
table? If the ID's are guaranteed to be sequential with pub_date, then
possible defaults are either the system date when the article is
loaded, or the pub_date (or maybe pub_date+1day?) of the previous
article. Lacking rules for a default pub_date, then the data entry
needs to require the pub_date before inserting the article.

If you really must continue to use NULL on pub_date, then you must
consider whether the rows with NULL really belong in your result set.
You really have two result sets: one set fitting the date requested,
and one for all the articles with NULL pub_date. what of there was a
row:
0 NULL
if you selected for pub_date 2007-01-01, WHY is it not in your result
set? What business rule? That rule should suggest wht the real
criteria is on your query, since it is not just pub_date. By pub_date
alone, either all the NULL dates should be included or all shouled be
excluded. You need to know and state the rules. Your current query is
imbuing the id attribute with properties it doesn't explicitly have.
You may need to rethink your data model, not just your query.

Quote:
One obvious solution to me is to create a table for storing just this
value, but something seems wrong to me about potentially lots of
little 1-column, 1-row tables.
This is not obvious. I do not see this as a "1-column, 1-row table".
Quote:
If there are multiple of this values that need to periodically
updated, then they could all be in 1 miscellaneous table, but then I
would have the problem of deciding what to make the primary key for
each row, and it would have to be hard-coded in the update and select
queries to distinguish this cached value from other cached values in
the same table, as well as dealing with values having different types.
the last part of this ^^^ paragraph doesn't make sense. What "cache"
are you talking about? Queries deal with tables and columns. Any
caching is internal and is a matter for your DBA to worry about. Maybe
you really need to talk to him.

IF I were to add another table, I think it would have the default date
I was describing above, so it would need the id attribute as well.
IOW, it would be a pub_date search table, where the attributes are the
date column (constrained to NOT NULL) and the id for that date. The PK
would be the composite date and id. It still has the issue of
defining the default date!

There is nothing "hard-coded" here, just an extra level of indirection
on SELECT and triggers on INSERT and UPDATE (pgsql does have triggers,
right?)

Quote:
I hope my question is clear enough. Thanks for any suggestions about
best strategies for dealing with this kind of issue.
Most of it was clear enough. The best strategy is always to meet the
business requirements. Since you haven't given us all of them
(understandably in the limited space of a newsgroup posting) we must
make assumptions. I hope I made my assumptions clear.

Ed
PS (I personally dislike generic ID attributes as the PK of entities.
When not carefully designed into the data model, it too often leads to
pain and sorrow.)




Reply With Quote
  #4  
Old   
j.k.
 
Posts: n/a

Default Re: general question about how best to cache expensive query results - 05-07-2007 , 04:18 AM



Thanks for the response, Robert. Comments interspersed below...

On May 3, 1:27 am, Robert Klemme <shortcut... (AT) googlemail (DOT) com> wrote:
Quote:
On 03.05.2007 02:12, j.k. wrote:

How do you do that if not all records have a pub_date? Consider

id pub_date
1 2007-01-01
2 2007-01-09
3 (NULL)
4 2007-02-01

Now, if your limit is 2007-01-17 you cannot know whether your max id is
3 or 2 - whatever you do, with this data you will always get 2 because
the RDBMS cannot decide whether 3 was before or after 2007-01-17.

In my data, the dates are actually timestamps and are very dense,
meaning that there are many of them every day and hour. Additionally,
I only need an extremely rough idea of the id, so either would be
fine. My apologies for not making it clear that I do not need an exact
id and that approximation is fine.

Quote:
First of all, IMHO the sub query can be improved. Because what you
really want is the MAX. So, I'd rather do

-- slightly different semantics, but should work according
-- to your description
select max(id)
from entries
where pub_date < current_timestamp - interval '5 days'

-- or, more like your original, in case ids might not be ordered
select max(id)
from entries, (
select max(pub_date) max_date
from entries
where pub_date < current_timestamp - interval '5 days'
) e
where entries.pub_date = e.max_date
Thanks, I like you approach using max better, which is also more
portable, and my ids are ordered, so your first suggestion works.

Quote:
Note, there might be other approaches - probably even more efficient ones.

Then, I'd do proper indexing. (Btw, you do not mention DDL - that
would certainly help.)

Only if it is still slow then I'd consider "caching".

Caution against premature optimization duly noted.

For reasons entirely unrelated to these issues, I no longer need this
query at all.

However, my original question still stands: say a certain query's
calculated value is needed extremely frequently and *is* too expensive
to calculate every time it is needed, what are the standard ways of
storing thes temporary result? Sometimes the database might cache it
well enough, and I can just do the query again and again and not worry
about storing it, but there will be times -- a huge frequently updated
table when reasonably out of date is sufficient -- when I must store
the result.

In the case that the result is more than one row, it seems reasonable
to use a table for storing the temporary result, but what if the
result is just a single value? What if there are many such unrelated
single values?

Thanks again,

j.k.



Reply With Quote
  #5  
Old   
j.k.
 
Posts: n/a

Default Re: general question about how best to cache expensive query results - 05-07-2007 , 06:31 AM



On May 4, 5:12 am, Ed Prochak <edproc... (AT) gmail (DOT) com> wrote:
Quote:
On May 2, 8:12 pm, "j.k." <spaeci... (AT) gmail (DOT) com> wrote:

What are general approaches to this kind of problem?

Don't let pub_date be null. Seriously, you are making workarounds for
failing to meet what is the obvious business requirement.
I wish that I could not let pub_date be null, but I have no control
over that. Third parties -- with whom I have no relation -- create the
feeds (which contain entries) and sometimes don't include the field
that becomes pub_date.

Quote:
Your
assumption about the ID's is bogus because someone could back to fill
in publication date with a value that does not fit your assumption.
You did not mention any guarantee that the date eventually provided
will fit your model.

To use Robert's sample data:

id pub_date
1 2007-01-01
2 2007-01-09
3 (NULL)
4 2007-02-01

what prevents someone from updating id=3 to pub_date=2007-01-06?
The data could never be updated (nothing touches the entry after it is
inserted, so pub_date would never be updated), but you are correct
that the pub_dates are not sequential. I hadn't thought things
through when I sent the original email. The original query does not
work for other reasons. That was my first attempt to limit the entries
that are returned to those within the last 5 days or so, but it fails
horribly when, for example, I've just inserted a 5 day-old entry that
would be the result of that original query. And it fails for nulls too
in more cases than I envisioned originally.

At present, I'm not sure how I'm going to deal with it. The problem is
what you identified further down.

Quote:
So can you provide a default date when a new article is logged in this
table?
In some cases, I can guess a good enough default date or create one
that will work, but unfortunately not in all cases.

Quote:
If the ID's are guaranteed to be sequential with pub_date, then
possible defaults are either the system date when the article is
loaded, or the pub_date (or maybe pub_date+1day?) of the previous
article. Lacking rules for a default pub_date, then the data entry
needs to require the pub_date before inserting the article.
The ids are guaranteed to be sequential, but the pub_dates are not
sequential. Your suggestions are both good, and they would both work
for some cases, but there are still some cases where I couldn't use
either one.

Quote:
If you really must continue to use NULL on pub_date, then you must
consider whether the rows with NULL really belong in your result set.
Yes, I'm not sure. The actual query is something like "get all entries
for a given feed in descending order of pub_date, limited to 100 or
so," except that some feeds don't have a pub_date, and I can't
determine a good guess in all cases.

Quote:
You really have two result sets: one set fitting the date requested,
and one for all the articles with NULL pub_date. what of there was a
row:
0 NULL
if you selected for pub_date 2007-01-01, WHY is it not in your result
set? What business rule? That rule should suggest wht the real
criteria is on your query, since it is not just pub_date. By pub_date
alone, either all the NULL dates should be included or all shouled be
excluded. You need to know and state the rules. Your current query is
imbuing the id attribute with properties it doesn't explicitly have.
You may need to rethink your data model, not just your query.

The only reason the pub_date would be part of the query I gave in
prose above is to provide a sort order so that if there are too many
entries and the results are limited to 100, it is the 100 most recent
ones. So the question reduces to where nulls go in the sort order, and
both first and last are problematic, as you noted.

The only thing I can think to do at present is to have an additional
column that specifically provides the sort criterion. For entries
within a given feed (but not across feeds), it should have the
property that sort_criterion_A is less than sort_criterion_B if and
only if entry A should be sorted after entry B. I don't know the
pub_date in some cases, but since the entries come to me in order from
most recent to least recent (within a single feed), I can insert them
with a sort_criterion value that preserves the ordering.

The primary id actually fulfills this right now, since it is auto-
generated from a sequence, incremented once for each new entry, and
the entries for a given feed are inserted from least recent to most
recent. I have to sleep on this a little, but the explicit sort
criterion idea might work.

Quote:
One obvious solution to me is to create a table for storing just this
value, but something seems wrong to me about potentially lots of
little 1-column, 1-row tables.

This is not obvious. I do not see this as a "1-column, 1-row table".

What I was thinking was that I need to store entry id = 144194 (for
example) somewhere, update it once an hour, and select it many times a
second. The 1-column, 1-row table I was thinking of was:

create table cache (
cached_id integer not null
);

And thus the other questions I gave of what to use for primary key,
etc.

Quote:
If there are multiple of this values that need to periodically
updated, then they could all be in 1 miscellaneous table, but then I
would have the problem of deciding what to make the primary key for
each row, and it would have to be hard-coded in the update and select
queries to distinguish this cached value from other cached values in
the same table, as well as dealing with values having different types.

the last part of this ^^^ paragraph doesn't make sense. What "cache"
are you talking about?
Well, I said 'cached value' and not 'cache', meaning any value that is
stored some place and read from that place rather than being re-
calculated. I was asking what the options are for persisting some
value rather than recalculating it; i.e., do all such values belong in
one table, or in multiple tables, etc.

Quote:
Queries deal with tables and columns.
Agreed. My perhaps poorly phrased questions were about what form the
tables and columns should take.

Quote:
IF I were to add another table, I think it would have the default date
I was describing above, so it would need the id attribute as well.
IOW, it would be a pub_date search table, where the attributes are the
date column (constrained to NOT NULL) and the id for that date. The PK
would be the composite date and id. It still has the issue of
defining the default date!
I will try to fully normalize the last remaining holdouts in the
entries table, including the pub_date. Everything else is very
normalized and no nulls are possible, but I allowed a few potentially
null columns to remain in entries (one of which was pub_date) because
I was worried about performance. That table is already the bottleneck
due to the extent that it is already normalized; the query requires 5
or so joins, and there are many in the table, and new entries are
frequently inserted. Maybe I will start another topic when I have it
fully normalized if the performance is inadequate. (Yeah, I know;
should have been BCNF first before thinking about performance. Forgive
me Knuth or Hoare.)

Quote:
There is nothing "hard-coded" here, just an extra level of indirection
on SELECT and triggers on INSERT and UPDATE (pgsql does have triggers,
right?)
The hard-coding I was thinking of was the name of the 'cache' table
(and the relevant column) I showed above. The insert, update, and
select would all need to know that. What I was originally hoping was
that there was some standard way of being able to specify the ideal
query that I would like to perform and specify additionally that it
shouldn't be performed more than once every hour, and the database
would somehow take care of the rest. I guess I was hoping for
something like a view that just updates once an hour.

The obvious solution I was alluding to was to create a table that
stores the value, and have a trigger that updates it once an hour, and
a stored procedure that retrieves the value if there, creating it if
not. This seemed messy to me.


Quote:
Most of it was clear enough. The best strategy is always to meet the
business requirements. Since you haven't given us all of them
(understandably in the limited space of a newsgroup posting) we must
make assumptions. I hope I made my assumptions clear.
Thanks very much for taking the time to respond so thoroughly. I
apologize for my lack of clarity. I was focusing on the perceived
solution and just wanted the quick means to that end, rather than
presenting the underlying problem and the business requirements. I've
had to think things through a lot more thanks to your response though.

Quote:
Ed
PS (I personally dislike generic ID attributes as the PK of entities.
When not carefully designed into the data model, it too often leads to
pain and sorrow.)
I usually prefer a more meaningful id as well, but wasn't able to
identify anything suitable among the other attributes.

Cheers,

joseph



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.