![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the details of this bug report. |
|
The occurrence rate is somewhere in the one per tens-of-millions of queries. |
#3
| |||
| |||
|
|
The way that I'd personally proceed to investigate it would probably be to change the "invalid memory alloc request size" size errors (in src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from ERROR to PANIC so that they'll provoke a core dump, and then use gdb to get a stack trace, which would provide at least a little more information about what happened. However, if you are only able to reproduce it in a production server, you might not like that approach. Perhaps you can set up an extra standby that's only there for testing, so you don't mind if it crashes? |
#4
| |||
| |||
|
|
*ERROR: invalid memory alloc request size 18446744073709551613 At least once, a hot standby was promoted to a primary and the errors seem to discontinue, but then reappear on a newly-provisioned standby. |
#5
| |||
| |||
|
#6
| |||
| |||
|
|
Hello, We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencingan issue that seems very similar to the one reported as bug 6200.* We see approximately 2 dozen alloc errors per day across 3 slaves, and we are getting one segfault approximately every 3 days.* We did not experiencethis issue before our upgrade (we were on version 8.4, and used skytools for replication). We are attempting to get a core dump on segfault (our last attempt did not work due to a config issue for the core dump).* We're also attempting to repro the alloc errors on a test setup, but it seems like we may need quite a bit of load to trigger the issue.* We're not certain that the alloc issues and the sefaults are "the same issue" - but it seems that it may be since the OP for bug 6200 sees the same behavior.* We have seen no issues on the master, all alloc errors and segfaults have been on the slaves. We've seen the alloc errors on a few different tables, but most frequently on logins.* Rows are added to the logins table one-by-one, and updates generally happen one row at a time.* The table is pretty basic, it looks like this... CREATE TABLE logins ( * login_id bigserial NOT NULL, * <snip - a bunch of columns * CONSTRAINT logins_pkey PRIMARY KEY (login_id ), * <snip - some other constraints... ) WITH ( * FILLFACTOR=80, * OIDS=FALSE ); The queries that trigger the alloc error on this table look like this (we use hibernate hence the funny underscoring...) select login0_.login_id as login1_468_0_, l...* from logins login0_ where login0_.login_id=$1 The alloc error in the logs looks like this: -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: invalid memory alloc request size 18446744073709551613 The alloc error is nearly always for size 18446744073709551613 - though we have seen one time where it was a different amount... |
#7
| |||
| |||
|
|
On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey <bridget.frey (AT) redfin (DOT) com wrote: Hello, We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an issue that seems very similar to the one reported as bug 6200. We see approximately 2 dozen alloc errors per day across 3 slaves, and we are getting one segfault approximately every 3 days. We did not experience this issue before our upgrade (we were on version 8.4, and used skytools for replication). We are attempting to get a core dump on segfault (our last attempt did not work due to a config issue for the core dump). We're also attempting to repro the alloc errors on a test setup, but it seems like we may need quite a bit of load to trigger the issue. We're not certain that the alloc issues and the sefaults are "the same issue" - but it seems that it may be since the OP for bug 6200 sees the same behavior. We have seen no issues on the master, all alloc errors and segfaults have been on the slaves. We've seen the alloc errors on a few different tables, but most frequently on logins. Rows are added to the logins table one-by-one, and updates generally happen one row at a time. The table is pretty basic, it looks like this... CREATE TABLE logins ( login_id bigserial NOT NULL, snip - a bunch of columns CONSTRAINT logins_pkey PRIMARY KEY (login_id ), snip - some other constraints... ) WITH ( FILLFACTOR=80, OIDS=FALSE ); The queries that trigger the alloc error on this table look like this (we use hibernate hence the funny underscoring...) select login0_.login_id as login1_468_0_, l... from logins login0_ where login0_.login_id=$1 The alloc error in the logs looks like this: -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: invalid memory alloc request size 18446744073709551613 The alloc error is nearly always for size 18446744073709551613 - though we have seen one time where it was a different amount... Hmm, that number in hex works out to 0xfffffffffffffffd, which makes it sound an awful lot like the system (for some unknown reason) attempted to allocate -3 bytes of memory. I've seen something like this once before on a customer system running a modified version of PostgreSQL. In that case, the problem turned out to be page corruption. Circumstances didn't permit determination of the root cause of the page corruption, however, nor was I able to figure out exactly how the corruption I saw resulted in an allocation request like this. It would be nice to figure out where in the code this is happening and put in a higher-level guard so that we get a better error message. You want want to compile a modified PostgreSQL executable that puts an extremely long sleep (like a year) just before this error is reported. Then, when the system hangs at that point, you can attach a debugger and pull a stack backtrace. Or you could insert an abort() at that point in the code and get a backtrace from the core dump. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
#8
| |||
| |||
|
|
Thanks for the info - that's very helpful.* We had also noted that the alloc seems to be -3 bytes.* We have run pg_check and it found no instances of corruption. We've also replayed queries that have failed, and have never been able to get the same query to fail twice.* In the case you investigated, was there permanent page corruption - e.g. you could run the same query over and over and get the same result? |
|
It really does seem like this is an issue either in Hot Standby or very closely related to that feature, where there is temporary corruption of a btree index that then disappears.* Our master is not experiencing any malloc issues, while the 3 slaves get about a dozen per day, despite similar workloads.* We haven't have a slave segfault since we set it up to produce a core dump, but we're expecting to have that within the next few days (assuming we'll continue to get a segfault every 3-4 days).* We're also planning to set up one slave that will panic when it gets a malloc issue,as you (and other posters on 6400) had suggested. Thanks again for the help, and we'll keep you posted as we learn more... |
#9
| |||
| |||
|
|
On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey <bridget.frey (AT) redfin (DOT) com wrote: Thanks for the info - that's very helpful. We had also noted that the alloc seems to be -3 bytes. We have run pg_check and it found no instances of corruption. We've also replayed queries that have failed, and have never been able to get the same query to fail twice. In the case you investigated, was there permanent page corruption - e.g. you could run the same query over and over and get the same result? Yes. I observed that the infomask bits on several tuples had somehow been overwritten by nonsense. I am not sure whether there were other kinds of corruption as well - I suspect probably so - but that's the only one I saw with my own eyes, courtesy of pg_filedump. It really does seem like this is an issue either in Hot Standby or very closely related to that feature, where there is temporary corruption of a btree index that then disappears. Our master is not experiencing any malloc issues, while the 3 slaves get about a dozen per day, despite similar workloads. We haven't have a slave segfault since we set it up to produce a core dump, but we're expecting to have that within the next few days (assuming we'll continue to get a segfault every 3-4 days). We're also planning to set up one slave that will panic when it gets a malloc issue, as you (and other posters on 6400) had suggested. Thanks again for the help, and we'll keep you posted as we learn more... The case I investigated involved corruption on the master, and I think it predated Hot Standby. However, the symptom is generic enough that it seems quite possible that there's more than one way for it to happen. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs |
#10
| |||
| |||
|
|
We have the (5GB) core file, and are happy to do any more forensics anyone can advise. |
![]() |
| Thread Tools | |
| Display Modes | |
| |