![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I have a general question about how best to cache query results that are too expensive to calculate each time they're needed. The question is that when I have a query like: SELECT e.id, e.title FROM entries e JOIN feeds f ON e.feed_id = f.id WHERE f.url = 'http://example.com' AND e.id > (select id from entries where pub_date current_timestamp - interval '5 days' order by pub_date desc limit 1); The last restriction in the query is the one that is too expensive to calculate every time its needed. It's purpose is to restrict the entries to those that are newer than a certain date, except that some entries don't specify a date, so I use the fact that ids are in ascending order in order of creation and find the highest id that's at least 5 days old. |
#3
| ||||
| ||||
|
|
I asked this question over in mailing.database.pgsql-sql a couple of days ago and got no answer. Perhaps the question was ill-phrased, or perhaps that was not the best forum. I'm reposting my question in here in the hope that someone can give me some general advice or suggestions for further research. -- I have a general question about how best to cache query results that are too expensive to calculate each time they're needed. The question is that when I have a query like: SELECT e.id, e.title FROM entries e JOIN feeds f ON e.feed_id = f.id WHERE f.url = 'http://example.com' AND e.id > (select id from entries where pub_date current_timestamp - interval '5 days' order by pub_date desc limit 1); The last restriction in the query is the one that is too expensive to calculate every time its needed. It's purpose is to restrict the entries to those that are newer than a certain date, except that some entries don't specify a date, so I use the fact that ids are in ascending order in order of creation and find the highest id that's at least 5 days old. In actuality, I don't need the relevant e.id to be updated more than once a day or so. It definitely does not need to be realtime, and is prohibitively expensive to calculate the actual value. What are general approaches to this kind of problem? |
|
One obvious solution to me is to create a table for storing just this value, but something seems wrong to me about potentially lots of little 1-column, 1-row tables. |
|
If there are multiple of this values that need to periodically updated, then they could all be in 1 miscellaneous table, but then I would have the problem of deciding what to make the primary key for each row, and it would have to be hard-coded in the update and select queries to distinguish this cached value from other cached values in the same table, as well as dealing with values having different types. |
|
I hope my question is clear enough. Thanks for any suggestions about best strategies for dealing with this kind of issue. |
#4
| |||
| |||
|
|
On 03.05.2007 02:12, j.k. wrote: How do you do that if not all records have a pub_date? Consider id pub_date 1 2007-01-01 2 2007-01-09 3 (NULL) 4 2007-02-01 Now, if your limit is 2007-01-17 you cannot know whether your max id is 3 or 2 - whatever you do, with this data you will always get 2 because the RDBMS cannot decide whether 3 was before or after 2007-01-17. |
|
First of all, IMHO the sub query can be improved. Because what you really want is the MAX. So, I'd rather do -- slightly different semantics, but should work according -- to your description select max(id) from entries where pub_date < current_timestamp - interval '5 days' -- or, more like your original, in case ids might not be ordered select max(id) from entries, ( select max(pub_date) max_date from entries where pub_date < current_timestamp - interval '5 days' ) e where entries.pub_date = e.max_date |
|
Note, there might be other approaches - probably even more efficient ones. Then, I'd do proper indexing. (Btw, you do not mention DDL - that would certainly help.) Only if it is still slow then I'd consider "caching". |
#5
| |||||||||||||
| |||||||||||||
|
|
On May 2, 8:12 pm, "j.k." <spaeci... (AT) gmail (DOT) com> wrote: What are general approaches to this kind of problem? Don't let pub_date be null. Seriously, you are making workarounds for failing to meet what is the obvious business requirement. |
|
Your assumption about the ID's is bogus because someone could back to fill in publication date with a value that does not fit your assumption. You did not mention any guarantee that the date eventually provided will fit your model. To use Robert's sample data: id pub_date 1 2007-01-01 2 2007-01-09 3 (NULL) 4 2007-02-01 what prevents someone from updating id=3 to pub_date=2007-01-06? |
|
So can you provide a default date when a new article is logged in this table? |
|
If the ID's are guaranteed to be sequential with pub_date, then possible defaults are either the system date when the article is loaded, or the pub_date (or maybe pub_date+1day?) of the previous article. Lacking rules for a default pub_date, then the data entry needs to require the pub_date before inserting the article. |
|
If you really must continue to use NULL on pub_date, then you must consider whether the rows with NULL really belong in your result set. |
|
You really have two result sets: one set fitting the date requested, and one for all the articles with NULL pub_date. what of there was a row: 0 NULL if you selected for pub_date 2007-01-01, WHY is it not in your result set? What business rule? That rule should suggest wht the real criteria is on your query, since it is not just pub_date. By pub_date alone, either all the NULL dates should be included or all shouled be excluded. You need to know and state the rules. Your current query is imbuing the id attribute with properties it doesn't explicitly have. You may need to rethink your data model, not just your query. |
|
One obvious solution to me is to create a table for storing just this value, but something seems wrong to me about potentially lots of little 1-column, 1-row tables. This is not obvious. I do not see this as a "1-column, 1-row table". |
|
If there are multiple of this values that need to periodically updated, then they could all be in 1 miscellaneous table, but then I would have the problem of deciding what to make the primary key for each row, and it would have to be hard-coded in the update and select queries to distinguish this cached value from other cached values in the same table, as well as dealing with values having different types. the last part of this ^^^ paragraph doesn't make sense. What "cache" are you talking about? |
|
Queries deal with tables and columns. |
|
IF I were to add another table, I think it would have the default date I was describing above, so it would need the id attribute as well. IOW, it would be a pub_date search table, where the attributes are the date column (constrained to NOT NULL) and the id for that date. The PK would be the composite date and id. It still has the issue of defining the default date! |
|
There is nothing "hard-coded" here, just an extra level of indirection on SELECT and triggers on INSERT and UPDATE (pgsql does have triggers, right?) |
|
Most of it was clear enough. The best strategy is always to meet the business requirements. Since you haven't given us all of them (understandably in the limited space of a newsgroup posting) we must make assumptions. I hope I made my assumptions clear. |
|
Ed PS (I personally dislike generic ID attributes as the PK of entities. When not carefully designed into the data model, it too often leads to pain and sorrow.) |
![]() |
| Thread Tools | |
| Display Modes | |
| |