![]() | |
![]() |
| | Thread Tools | Display Modes |
#21
| |||
| |||
|
|
On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane <tgl (AT) sss (DOT) pgh.pa.us> wrote: BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. Hmm. Could this be caused by the recovery process failing to obtain a sufficiently strong lock on a buffer before replaying some WAL record? |
#22
| |||
| |||
|
#23
| |||
| |||
|
|
Robert Haas <robertmhaas (AT) gmail (DOT) com> writes: On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane <tgl (AT) sss (DOT) pgh.pa.us> wrote: BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. Hmm. *Could this be caused by the recovery process failing to obtain a sufficiently strong lock on a buffer before replaying some WAL record? Well, I was kinda speculating that inadequate locking could result in use of a stale (or too-new?) tuple descriptor, and that would be as good a candidate as any if the basic theory were right. *But Bridget says they are not doing any DDL, so it's hard to see how there'd be any tuple descriptor mismatch at all. *Still baffled ... |
#24
| |||
| |||
|
|
No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. |
#25
| |||
| |||
|
|
Robert Haas <robertmhaas (AT) gmail (DOT) com> writes: No, I wasn't thinking about a tuple descriptor mismatch. *I was imagining that the page contents themselves might be in flux while we're trying to read from it. Oh, gotcha. *Yes, that's a horribly plausible idea. *All it'd take is one WAL replay routine that hasn't been upgraded to acquire sufficient buffer locks. *Pre-hot-standby, there was no reason for them to be careful about locking. On the other hand, if that were the cause, you'd expect the symptoms to be a bit more variable... |
#26
| |||
| |||
|
|
No, I wasn't thinking about a tuple descriptor mismatch. *I was imagining that the page contents themselves might be in flux while we're trying to read from it. It would be nice to get a dump of what PostgreSQL thought the entire block looked like at the time the crash happened. That information is presumably already in the core dump, but I'm not sure if there's a nice way to extract it using gdb. |
#27
| |||
| |||
|
|
Robert Haas <robertmhaas (AT) gmail (DOT) com> writes: No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. =20 It would be nice to get a dump of what PostgreSQL thought the entire block looked like at the time the crash happened. That information is presumably already in the core dump, but I'm not sure if there's a nice way to extract it using gdb. =20 It probably would be possible to get the page out of the dump, but I'd be really surprised if that proved much. By the time the crash-dump-making code gets around to examining the shared memory, the other process that's hypothetically changing the page will have done its work and moved on. A crash in process X doesn't freeze execution in process Y, at least not in any Unixoid system I've ever heard of. |
#28
| |||
| |||
|
#29
| |||
| |||
|
![]() |
| Thread Tools | |
| Display Modes | |
| |