![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
In the postgresql.log a write-failure messages was repeated enough to make the log file 50MB larger: [ - 2005-09-07 13:03:47 CEST @] ERROR: xlog flush request 29/67713428 is not satisfied --- flushed only to 29/2E73E488 [ - 2005-09-07 13:03:47 CEST @] CONTEXT: writing block 21 of relation 1663/2013826/9975789 ... TopMemoryContext: -1095880208 total in 264213 blocks; 537938888 free (924739 chunks); -1633819096 used MdSmgr: 8192 total in 1 blocks; 8024 free (0 chunks); 168 used Pending Ops Table: 8192 total in 1 blocks; 6112 free (0 chunks); 2080 used DynaHash: 8192 total in 1 blocks; 7488 free (0 chunks); 704 used smgr relation table: 8192 total in 1 blocks; 4048 free (0 chunks); 4144 used LockTable (locallock hash): 8192 total in 1 blocks; 6112 free (0 chunks); 2080 used ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used [ - 2005-09-09 02:42:22 CEST @] ERROR: out of memory [ - 2005-09-09 02:42:22 CEST @] DETAIL: Failed on request of size 16000. |
#3
| |||
| |||
|
|
It had taken over 800MB of memory. The machine is a P4 3.0 Ghz (with HT enabled), 1GB of memory a few SATA disks and runs with a recent Gentoo + 2.6.12.5 kernel. The postgresql version that failed was 8.0.3, it may or may not be worth knowing that it runs a 8.1devel on another port and from another directory. In the postgresql.log a write-failure messages was repeated enough to make the log file 50MB larger: |
#4
| |||
| |||
|
|
As a matter of fact, I just tested it again. Issuing that same query will trigger the issue again when I execute it. I don't know whether the query matters, but I know this one will trigger it, so I didn't try others. |
|
Since its a development machine, access to it is a possibility. But if you can give me a few pointers how to gather usefull information without any "stranger" accessing the machine, that'd be nice. |
#5
| |||
| |||
|
|
On 13-9-2005 16:25, Tom Lane wrote: The first thing you ought to find out is which table 1663/2013826/9975789 is, and look to see if the corrupted LSN value is already present on disk in that block. Well, its an index, not a table. It was the index: "pg_class_relname_nsp_index" on pg_class(relname, relnamespace). |
|
Using pg_filedump I extracted the LSN for block 21 and indeed, that was already 67713428 instead of something below 2E73E53C. It wasn't that block alone though, here are a few LSN-lines from it: LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0) LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0) LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0) LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0) LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0) LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0) LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0) |
|
On that day I did some active query-tuning, but a few times it took too long, so I issued immediate shut downs when the selects took too long. There were no warnings about broken records afterwards in the log though, so I don't believe anything got damaged afterwards. |
#6
| |||
| |||
|
|
Arjen van der Meijden <acm (AT) tweakers (DOT) net> writes: On 13-9-2005 16:25, Tom Lane wrote: Well, its an index, not a table. It was the index: "pg_class_relname_nsp_index" on pg_class(relname, relnamespace). Ah. So you've reindexed pg_class at some point. Reindexing it again would likely get you out of this. |
|
Using pg_filedump I extracted the LSN for block 21 and indeed, that was already 67713428 instead of something below 2E73E53C. It wasn't that block alone though, here are a few LSN-lines from it: LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0) LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0) LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0) LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0) LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0) LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0) LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0) logid is the high-order half of the LSN, so there's nothing wrong with those other pages --- it's only the first one you show there that seems to be past the current end of WAL. |
|
On that day I did some active query-tuning, but a few times it took too long, so I issued immediate shut downs when the selects took too long. There were no warnings about broken records afterwards in the log though, so I don't believe anything got damaged afterwards. I have a feeling something may have gone wrong here, though it's hard to say what. If the bogus pages in the other tables all have LSNs close to this one then that makes it less likely that this is a random corruption event --- what would be more plausible is that end of WAL really was that high and somehow the WAL counter got reset back during one of those forced restarts. Can you show us ls -l output for the pg_xlog directory? I'm interested to see the file names and mod dates there. |
#7
| |||
| |||
|
|
Unless reindexing is part of other commands, I didn't do that. The last time 'grep' was able to find an reference to something being reindexed was in June, something (maybe me, but I doubt it, I'd also reindex the user-tables, I suppose) was reindexing all system tables back then. |
|
l /var/lib/postgresql/data/pg_xlog/ total 145M drwx------ 3 postgres postgres 4.0K Sep 1 12:37 . drwx------ 8 postgres postgres 4.0K Sep 13 20:31 .. -rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E -rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067 -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068 -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069 -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E |
#8
| |||
| |||
|
|
Arjen van der Meijden <acm (AT) tweakers (DOT) net> writes: Unless reindexing is part of other commands, I didn't do that. The last time 'grep' was able to find an reference to something being reindexed was in June, something (maybe me, but I doubt it, I'd also reindex the user-tables, I suppose) was reindexing all system tables back then. Well, that would have done it. But if you have such complete logs, do you have the postmaster log from the whole period? I have logs from the time I started using it seriously indeed. It was |
|
Look for sequences approximately like this: and note whether there's any one that shows a WAL end address (here 0/12B65A88) that's less than the one before. |
|
l /var/lib/postgresql/data/pg_xlog/ total 145M drwx------ 3 postgres postgres 4.0K Sep 1 12:37 . drwx------ 8 postgres postgres 4.0K Sep 13 20:31 .. -rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E -rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067 -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068 -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069 -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E This is definitely ugly: there's no way all those higher-numbered segments would be in place if the WAL counter hadn't been up to that range. So I think we are definitely looking at a WAL-counter-went- backward problem, not some random data corruption. Judging from the file dates, the WAL counter was still OK (ie, above 67xxx) on Sep 1, and the problem must have occurred since then. |
#9
| |||
| |||
|
|
These are all the lines there are for Sep 1: [ - 2005-09-01 12:36:50 CEST @] LOG: received fast shutdown request [ - 2005-09-01 12:36:50 CEST @] LOG: shutting down [ - 2005-09-01 12:36:50 CEST @] LOG: database system is shut down [ - 2005-09-01 12:37:01 CEST @] LOG: received smart shutdown request [ - 2005-09-01 12:37:01 CEST @] LOG: shutting down [ - 2005-09-01 12:37:01 CEST @] LOG: database system is shut down |
#10
| |||
| |||
|
|
Arjen van der Meijden <acm (AT) tweakers (DOT) net> writes: That's all? There's something awfully suspicious about that. You're sure this is 8.0.3? |
|
AFAICS it is absolutely impossible for the 8.0 postmaster.c code to emit "received smart shutdown request" after emitting "received fast shutdown request". The SIGINT code looks like and the SIGTERM code looks like and there are no other places that change the value of Shutdown, and certainly FastShutdown > SmartShutdown. So I wonder if something got lost in the log entries. |
|
Another question is why the postmaster didn't exit at 12:36:50. It was not waiting on any backends, else it would not have launched the shutdown process (which is what emits the other two messages). [ thinks for a bit ... ] I wonder if Shutdown ought to be marked volatile, since it is after all changed by a signal handler. But given the way the postmaster is coded, this doesn't seem likely to be an issue. Basically all of the code runs with signals blocked. Can you try to reconstruct what you did on Sep 1, and see whether you can reproduce the above behavior? |
![]() |
| Thread Tools | |
| Display Modes | |
| |