dbTalk Databases Forums  

Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Sean Chittenden
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-04-2004 , 12:42 AM






Quote:
I'v find out that this error occurs in:
dependency.c file

2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation
149064743 failed
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist

in getRelationDescription(StringInfo buffer, Oid relid) function.

Any ideas what can cause this errors.
<aol>Me too.</aol>

But, I am suspecting that it's a race condition with the new background
writer code. I've started testing a new database design and was able
to reproduce this on my laptop nearly 90% of the time, but could only
reproduce it about 10% of the time on my production databases until I
figured out what the difference was, fsync.

fsync was causing enough of a slow down that SearchSysCache() was
finding the tuple, whereas with fsync = false, it wasn't able to find
it. But, in search of proving that it wasn't fsync (I use fsync =
false on my laptop to save my pour drive), I threw in a sleep in
between my tests, and I'm able to get things to work 100% of the time
by adding a sleep. The following fails to work with fsync = false, 90%
of the time and with fsync = true, only 10% of the time.

% psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql
&& psql -f test-end1.sql template1 && psql -f test-end2.sql template1

But, if I change the command to:

% psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql
&& psql -f test-end1.sql template1 && sleep 1 && psql -f test-end2.sql
template1

I have no problems with cache relation misses. As for what happens in
those commands, I'm:

-- 1) Dropping the test database and re-creating it
-- 2) In a different connection, load a rather large schema as the dba
-- 3) Connect again and create a temp table
-- 4) Connect a second time, and check to see if the temp table exists

The sleep comes at step 3.5 in the above sequence of operations.

*boom* Here's a snippet of my terminal (the first thing I do after
BEGINning a transaction is create a temp table if it doesn't exist):

## BEGIN ##
[snip]
[...]
COMMIT
You are now connected to database "test" as user "usr".
BEGIN
psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033
CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
pg_catalog.pg_table_is_visible(c.oid)"
PL/pgSQL function "create_tmptbl" line 2 at perform
PL/pgSQL function "check_or_populate_func" line 8 at assignment
PL/pgSQL function "setuid_wrapper_func" line 5 at return
## END ##

What's really bothering me is I can push the up arrow on the console,
run the exact same thing (including dropping the database), and it'll
work sometimes. Very disturbing. As I said, I'm *very* suspicious of
the background writer goo that Jan added simply because I can't think
of anything else that'd have this problem.

I've run each of those commands 100 times now, with and without the
sleep 1. With the sleep 1, it's worked 100% of the time. Jan, any bit
of code that comes to mind?

All of my bgwriter_* settings are set to their default.

-sc

--
Sean Chittenden


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo (AT) postgresql (DOT) org)


Reply With Quote
  #2  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-04-2004 , 01:06 AM






Sean Chittenden <sean (AT) chittenden (DOT) org> writes:
Quote:
But, I am suspecting that it's a race condition with the new background
writer code.
Why? Have you demonstrated that the failure does not occur in 7.4?

Quote:
psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033
CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
pg_catalog.pg_table_is_visible(c.oid)"
I think that pg_table_is_visible() will inspect the catalogs using
SnapshotNow semantics, while the above query will feed it with OIDs that
were valid under a start-of-query snapshot. So I'd expect failures in
any recent PG version, if tables are being dynamically created/deleted
by concurrent transactions.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match


Reply With Quote
  #3  
Old   
Sean Chittenden
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-04-2004 , 02:29 AM



[Renames thread from "The Pepsi Challenge" to "The PostgreSQL
Challenge"]

Quote:
But, I am suspecting that it's a race condition with the new
background
writer code.

Why? Have you demonstrated that the failure does not occur in 7.4?
What other operations have been added to HEAD that would allow for
successful operation of sequential use or testing of temp tables?

More importantly, yes, I can confirm that this behavior doesn't exist
in REL7_4 as of 40min ago. I've been running the exact same tests
repeatedly with nearly identical configs (as close as they can be given
the config changes) on REL7_4 as I have HEAD and only HEAD is giving me
problems.

Quote:
psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033
CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
pg_catalog.pg_table_is_visible(c.oid)"

I think that pg_table_is_visible() will inspect the catalogs using
SnapshotNow semantics, while the above query will feed it with OIDs
that
were valid under a start-of-query snapshot. So I'd expect failures in
any recent PG version, if tables are being dynamically created/deleted
by concurrent transactions.
There is no concurrency in the test I gave: it's all sequential.

0 | -- connect
1 | BEGIN;
2 | SELECT setuid_wrapper();
3 | -- Do other things;
4 | COMMIT;
5 | -- disconnect
6 | -- connect again
7 | BEGIN;
8 | SELECT setuid_wrapper(); -- This fails some % of the time

If I add step 5.5 that is a sleep, step 8 will work, without fail. In
7.4, I have no problems, however. That said, the bgwriter code is the
only think that I can think of that would muck with caching. -sc


PS Other comments temp schema permission patch?

--
Sean Chittenden


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster


Reply With Quote
  #4  
Old   
Sean Chittenden
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-05-2004 , 03:53 PM



Quote:
I'v find out that this error occurs in:
dependency.c file

2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of
relation
149064743 failed
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist

in getRelationDescription(StringInfo buffer, Oid relid) function.

Any ideas what can cause this errors.
aol>Me too.</aol
But, I am suspecting that it's a race condition with the new
background writer code. I've started testing a new database design
and was able to reproduce this on my laptop nearly 90% of the time,
but could only reproduce it about 10% of the time on my production
databases until I figured out what the difference was, fsync.

temp tables don't use the shared buffer cache, how can this be related
to the BG writer?
Don't the system catalogs use the shared buffer cache?

BEGIN;
SELECT create_temp_table_func(); -- Inserts a row into pg_class via
CREATE TEMP TABLE
-- Do other stuff
COMMIT; -- After the commit, the row is now visible to other
backends
-- disconnect -- If the delay between the disconnect and reconnect is
small enough
-- reconnect -- It's as though there is a race condition that allows
the function
-- pg_table_is_visible() to assert the "cache lookup of relation"
-- error.
BEGIN;
SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I
call
/* SELECT TRUE FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = ''footmp''::TEXT AND
c.relkind = ''r''::TEXT AND
pg_catalog.pg_table_is_visible(c.oid); */
-- But the query fails

My guess was that the series of events went something like:

proc 0) COMMIT's and the row in pg_class is committed
proc 1) bgwriter writer code removes a page for the cache
proc 2) queries for the page[*]
proc 1) writes it to disk
proc 2) queries for the page[*]
proc 1) sync's the fd
[*] proc 2 queries for the page at either of these points

In 7.4, there is no bgwriter or background process mucking with cache,
which is why this works 100% of the time. In 7.5, however, there's a
200ms gap where a race condition appears and pg_table_is_visible()
fails its PointerIsValid() check. If I put a sleep in, the sleep gives
the bgwriter enough time to commit the pages to disk so that the
queries for the page happen after the fd's been sync()'ed.

I have no other clue as to why this would be happening though, so
believe me when I say, I could very well be quite wrong.... but this is
my best, quasi-educated/grep(1)'ed guess.

-sc

--
Sean Chittenden


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org


Reply With Quote
  #5  
Old   
Sean Chittenden
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-06-2004 , 01:03 AM



Quote:
temp tables don't use the shared buffer cache, how can this be
related to the BG writer?
Don't the system catalogs use the shared buffer cache?
BEGIN;
SELECT create_temp_table_func(); -- Inserts a row into pg_class via
CREATE TEMP TABLE
-- Do other stuff
COMMIT; -- After the commit, the row is now visible to other
backends
-- disconnect -- If the delay between the disconnect and reconnect
is small enough
-- reconnect -- It's as though there is a race condition that allows
the function
-- pg_table_is_visible() to assert the "cache lookup of relation"
-- error.
BEGIN;
SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I
call
/* SELECT TRUE FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = ''footmp''::TEXT AND
c.relkind = ''r''::TEXT AND
pg_catalog.pg_table_is_visible(c.oid); */
-- But the query fails
My guess was that the series of events went something like:
proc 0) COMMIT's and the row in pg_class is committed
proc 1) bgwriter writer code removes a page for the cache
proc 2) queries for the page[*]
proc 1) writes it to disk
proc 2) queries for the page[*]
proc 1) sync's the fd[*] proc 2 queries for the page at either of these points
In 7.4, there is no bgwriter or background process mucking with cache,

Except for the checkpoint process, which does exactly the same as the
bgwriter does, and ALL concurrent backends whenever they feel the need
to evict a dirty buffer.
Hrm... well, haven't the slightest idea what would be causing this
then. About all I can say is that some problem does exist in HEAD that
doesn't exist in REL7_4 that I'm able to tickle via temp tables. :-/
Because this is time sensitive, what debugging foo could I insert to
get some useful diagnostic output?

Quote:
If it makes a difference if a pg_class page is dirty in the buffer or
copied out to disk with respect to visibility rules of the tuples
contained in it, then the whole thing is a way larger bug than the one
in MIB. First of all, committed or not, a temp object from one session
should NEVER be visible in any other.
Hrm... well, I'm going to take my test scripts and reduce them down to
a test case. For sure, there's something different in HEAD that's
causing problems that are time sensitive. I've even thought about
grabbing my camera and making a low res 320x200 movie of the test
sequence. I went and ran script(1) on one of the runs for the sake of
something.

http://sean.chittenden.org/postgresq...eate-temp-bug-
typescript.txt

Any help or assistance is greatly appreciated, I'm not sure where to go
from here. -sc

--
Sean Chittenden


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
message can get through to the mailing list cleanly


Reply With Quote
  #6  
Old   
Jan Wieck
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-06-2004 , 06:25 PM



Sean Chittenden wrote:
Quote:
I'v find out that this error occurs in:
dependency.c file

2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation
149064743 failed
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist

in getRelationDescription(StringInfo buffer, Oid relid) function.

Any ideas what can cause this errors.

aol>Me too.</aol

But, I am suspecting that it's a race condition with the new background
writer code. I've started testing a new database design and was able
to reproduce this on my laptop nearly 90% of the time, but could only
reproduce it about 10% of the time on my production databases until I
figured out what the difference was, fsync.
temp tables don't use the shared buffer cache, how can this be related
to the BG writer?


Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = JanWieck (AT) Yahoo (DOT) com #


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster


Reply With Quote
  #7  
Old   
Jan Wieck
 
Posts: n/a

Default Re: [BUGS] [GENERAL] cache lookup of relation 165058647 failed - 05-06-2004 , 06:42 PM



Sean Chittenden wrote:

Quote:
I'v find out that this error occurs in:
dependency.c file

2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of
relation
149064743 failed
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist

in getRelationDescription(StringInfo buffer, Oid relid) function.

Any ideas what can cause this errors.
aol>Me too.</aol
But, I am suspecting that it's a race condition with the new
background writer code. I've started testing a new database design
and was able to reproduce this on my laptop nearly 90% of the time,
but could only reproduce it about 10% of the time on my production
databases until I figured out what the difference was, fsync.

temp tables don't use the shared buffer cache, how can this be related
to the BG writer?

Don't the system catalogs use the shared buffer cache?

BEGIN;
SELECT create_temp_table_func(); -- Inserts a row into pg_class via
CREATE TEMP TABLE
-- Do other stuff
COMMIT; -- After the commit, the row is now visible to other
backends
-- disconnect -- If the delay between the disconnect and reconnect is
small enough
-- reconnect -- It's as though there is a race condition that allows
the function
-- pg_table_is_visible() to assert the "cache lookup of relation"
-- error.
BEGIN;
SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I
call
/* SELECT TRUE FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = ''footmp''::TEXT AND
c.relkind = ''r''::TEXT AND
pg_catalog.pg_table_is_visible(c.oid); */
-- But the query fails

My guess was that the series of events went something like:

proc 0) COMMIT's and the row in pg_class is committed
proc 1) bgwriter writer code removes a page for the cache
proc 2) queries for the page[*]
proc 1) writes it to disk
proc 2) queries for the page[*]
proc 1) sync's the fd
[*] proc 2 queries for the page at either of these points

In 7.4, there is no bgwriter or background process mucking with cache,
Except for the checkpoint process, which does exactly the same as the
bgwriter does, and ALL concurrent backends whenever they feel the need
to evict a dirty buffer.

If it makes a difference if a pg_class page is dirty in the buffer or
copied out to disk with respect to visibility rules of the tuples
contained in it, then the whole thing is a way larger bug than the one
in MIB. First of all, committed or not, a temp object from one session
should NEVER be visible in any other.


Jan

Quote:
which is why this works 100% of the time. In 7.5, however, there's a
200ms gap where a race condition appears and pg_table_is_visible()
fails its PointerIsValid() check. If I put a sleep in, the sleep gives
the bgwriter enough time to commit the pages to disk so that the
queries for the page happen after the fd's been sync()'ed.

I have no other clue as to why this would be happening though, so
believe me when I say, I could very well be quite wrong.... but this is
my best, quasi-educated/grep(1)'ed guess.

-sc


--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = JanWieck (AT) Yahoo (DOT) com #


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.