dbTalk Databases Forums  

[BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
valgog@gmail.com
 
Posts: n/a

Default [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-22-2012 , 04:36 PM






The following bug has been logged on the website:

Bug reference: 6661
Logged by: Valentine Gogichashvili
Email address: valgog (AT) gmail (DOT) com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41sque
Description:

Hello,

today when trying to restore a replication database I got the following
error:

....
2012-05-22 21:20:24.823 CEST,,,23804,,4fbbe69e.5cfc,124,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001E"" from
archive",,,,,,,,,""
2012-05-22 21:20:24.954 CEST,,,23804,,4fbbe69e.5cfc,125,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001F"" from
archive",,,,,,,,,""
2012-05-22 21:20:26.374 CEST,,,23804,,4fbbe69e.5cfc,126,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000020"" from
archive",,,,,,,,,""
2012-05-22 21:20:27.238 CEST,,,23804,,4fbbe69e.5cfc,127,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000021"" from
archive",,,,,,,,,""
2012-05-22 21:20:27.443 CEST,,,23804,,4fbbe69e.5cfc,128,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"consistent recovery state reached at
1C9C/21318C20",,,,,,,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=916)
[0]=3674726497 [1]=3674727041 [2]=3674727042 [3]=3674727128 ",,,,,"xlog redo
running xacts: nextXid 3674728633 latestCompletedXid 3674728632
oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128
3674726497",,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 21:18:54
CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633
latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts:
3674727041 3674727042 3674727128 3674726497",,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""


This happened after replaying 124 WAL files from archive.

Full log is here: https://gist.github.com/ff51ccf506f5e522e271

I tried to start the database 3 times, with the same result.

Additionally I removed all existing WAL files from pg_xlog, and tried to
start again, but got the same result this time as well (though probably I
should have tried the other way round, as WAL files are anyway taken from
the archive if there are there... anyway, just want to describe everything I
did)

The database on that machine is running as a read-only hot-standby machine
for running long running statistical queries, that we do not want to run on
the production system. It has significantly bigger shared_buffers = 48GB (
in comparison to 8GB on the master ).

Sometimes there is really a lot happening on the master machine, and the
replication machine does not seem to be able to catch up with the master.
This usually happens, when the replay is paused because of a long running
query, but after the conflicting query is gone and the replication delay is
bigger then 2-3 Gigabytes, it seems not to be able catch up any longer, it
looks like a snowball effect. At this time it has a 100% of one CPU on
recovery process with practically no IOWait (so for me it seems like a CPU
bottleneck rather then IO one, that is rather confusing).

Here is the replication delay graph
https://lh4.googleusercontent.com/-U...tion_delay.png

In the past, if the replication delay was bigger then 20GB, I was stopping
the server, restarting it so, that is replays the WAL files from the archive
and subjectively it seemed, like the replay was faster, if shared_buffers
was set to much lower value (4GB).

So today, I did the same, but got an error "out-of-order XID insertion in
KnownAssignedXids" and could not restore the machine. Now, while rebuilding
hot-standby of that 250GB database I have some time to write this report.

More information about the affected machine:

Intel(R) Xeon(R) CPU E5540 @ 2.53GHz (16 Cores)

cat /proc/version
Linux version 2.6.32-5-amd64 (Debian 2.6.32-41squeeze2) (dannf (AT) debian (DOT) org)
(gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu Mar 22 17:26:33 UTC 2012

free -m
total used free shared buffers cached
Mem: 129192 128858 333 0 31 124969
-/+ buffers/cache: 3857 125334
Swap: 1952 1 1951


df -h
/dev/mapper/vg00-data
54G 1.3G 52G 3% /data
/dev/mapper/vg01-data1
1.7T 373G 1.3T 23% /data1





--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
  #2  
Old   
Valentine Gogichashvili
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-22-2012 , 07:11 PM






Oh, that is not good, I am getting the same error after rebuilding
hot-standby from master.


2012-05-23 02:08:36.960 CEST,,,21080,,4fbc2a84.5258,1,,2012-05-23 02:08:36
CEST,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2012-05-22 21:33:49 CEST",,"If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.",,,,,,,""
2012-05-23 02:08:36.967 CEST,,,21080,,4fbc2a84.5258,2,,2012-05-23 02:08:36
CEST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2012-05-23 02:08:36.987 CEST,,,21080,,4fbc2a84.5258,3,,2012-05-23 02:08:36
CEST,,0,LOG,00000,"restored log file ""0000000200001CA2000000BE"" from
archive",,,,,,,,,""
2012-05-23 02:08:36.987 CEST,,,21080,,4fbc2a84.5258,4,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"redo starts at 1CA2/BE38F0E0",,,,,,,,,""
2012-05-23 02:08:36.997 CEST,,,21080,,4fbc2a84.5258,5,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"file ""pg_multixact/offsets/0BF1"" doesn't exist,
reading as zeroes",,,,,"xlog redo create multixact 200389058 offset
473492754: 3675639270 3675639273",,,,""
2012-05-23 02:08:36.997 CEST,,,21080,,4fbc2a84.5258,6,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"file ""pg_multixact/members/1C38"" doesn't exist,
reading as zeroes",,,,,"xlog redo create multixact 200389058 offset
473492754: 3675639270 3675639273",,,,""
2012-05-23 02:08:37.096 CEST,,,21080,,4fbc2a84.5258,7,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000BF"" from
archive",,,,,,,,,""
2012-05-23 02:08:37.213 CEST,,,21080,,4fbc2a84.5258,8,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C0"" from
archive",,,,,,,,,""
2012-05-23 02:08:37.393 CEST,,,21080,,4fbc2a84.5258,9,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C1"" from
archive",,,,,,,,,""
2012-05-23 02:08:37.568 CEST,,,21080,,4fbc2a84.5258,10,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C2"" from
archive",,,,,,,,,""
2012-05-23 02:08:37.932 CEST,,,21080,,4fbc2a84.5258,11,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C3"" from
archive",,,,,,,,,""
2012-05-23 02:08:38.297 CEST,,,21080,,4fbc2a84.5258,12,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C4"" from
archive",,,,,,,,,""

.....

2012-05-23 02:08:41.617 CEST,,,21080,,4fbc2a84.5258,32,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000D8"" from
archive",,,,,,,,,""
2012-05-23 02:08:41.753 CEST,,,21080,,4fbc2a84.5258,33,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000D9"" from
archive",,,,,,,,,""
2012-05-23 02:08:41.881 CEST,,,21080,,4fbc2a84.5258,34,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000DA"" from
archive",,,,,,,,,""
2012-05-23 02:08:41.888 CEST,,,21080,,4fbc2a84.5258,35,,2012-05-23 02:08:36
CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=873)
[0]=3675654963 [1]=3675655620 [2]=3675655621 [3]=3675655814 ",,,,,"xlog
redo running xacts: nextXid 3675658265 latestCompletedXid 3675658264
oldestRunningXid 3675654963; 4 xacts: 3675655620 3675655621 3675655814
3675654963",,,,""
2012-05-23 02:08:41.888 CEST,,,21080,,4fbc2a84.5258,36,,2012-05-23 02:08:36
CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3675658265
latestCompletedXid 3675658264 oldestRunningXid 3675654963; 4 xacts:
3675655620 3675655621 3675655814 3675654963",,,,""
2012-05-23 02:08:41.909 CEST,,,21078,,4fbc2a84.5256,1,,2012-05-23 02:08:36
CEST,,0,LOG,00000,"startup process (PID 21080) exited with exit code
1",,,,,,,,,""
2012-05-23 02:08:41.909 CEST,,,21078,,4fbc2a84.5256,2,,2012-05-23 02:08:36
CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""


On Tue, May 22, 2012 at 11:36 PM, <valgog (AT) gmail (DOT) com> wrote:

Quote:
The following bug has been logged on the website:

Bug reference: 6661
Logged by: Valentine Gogichashvili
Email address: valgog (AT) gmail (DOT) com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41sque
Description:

Hello,

today when trying to restore a replication database I got the following
error:

...
2012-05-22 21:20:24.823 CEST,,,23804,,4fbbe69e.5cfc,124,,2012-05-22
21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001E"" from
archive",,,,,,,,,""
2012-05-22 21:20:24.954 CEST,,,23804,,4fbbe69e.5cfc,125,,2012-05-22
21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001F"" from
archive",,,,,,,,,""
2012-05-22 21:20:26.374 CEST,,,23804,,4fbbe69e.5cfc,126,,2012-05-22
21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000020"" from
archive",,,,,,,,,""
2012-05-22 21:20:27.238 CEST,,,23804,,4fbbe69e.5cfc,127,,2012-05-22
21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000021"" from
archive",,,,,,,,,""
2012-05-22 21:20:27.443 CEST,,,23804,,4fbbe69e.5cfc,128,,2012-05-22
21:18:54
CEST,1/0,0,LOG,00000,"consistent recovery state reached at
1C9C/21318C20",,,,,,,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22
21:18:54
CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=916)
[0]=3674726497 [1]=3674727041 [2]=3674727042 [3]=3674727128 ",,,,,"xlog
redo
running xacts: nextXid 3674728633 latestCompletedXid 3674728632
oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128
3674726497",,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22
21:18:54
CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633
latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts:
3674727041 3674727042 3674727128 3674726497",,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""



Reply With Quote
  #3  
Old   
Andres Freund
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-23-2012 , 01:49 AM



Hi Valentine,

On Tuesday, May 22, 2012 11:36:23 PM valgog (AT) gmail (DOT) com wrote:
Quote:
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22
21:18:54 CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=916)
[0]=3674726497 [1]=3674727041 [2]=3674727042 [3]=3674727128 ",,,,,"xlog
redo running xacts: nextXid 3674728633 latestCompletedXid 3674728632
oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128
3674726497",,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22
21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633
latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts:
3674727041 3674727042 3674727128 3674726497",,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
Could you provide a log with log_min_messages=DEBUG4? This will be quite long
though...

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
  #4  
Old   
Valentine Gogichashvili
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-23-2012 , 05:30 AM



Quote:
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22
21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633
latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts:
3674727041 3674727042 3674727128 3674726497",,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22
21:18:54
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22
21:18:54
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
Could you provide a log with log_min_messages=DEBUG4? This will be quite
long
though...

Andres

Hm... I have the data directory, but not the WAL files any more. Forgot to
back them up

Reply With Quote
  #5  
Old   
Andres Freund
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-23-2012 , 11:27 AM



On Wednesday, May 23, 2012 12:30:31 PM Valentine Gogichashvili wrote:

Quote:
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22
21:18:54 CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
Could you provide a log with log_min_messages=DEBUG4? This will be quite
long though...
Hm... I have the data directory, but not the WAL files any more. Forgot to
back them up
Too bad, I think without more information we won't be able to fix this. I feel
bad for it but I actually you run into it again

Here is what I found strange, should somebody ever come back to the issue:
- the standby was in a consistent state
- according to the error annotation we were processing a XLOG_RUNNING_XACTS
- ProcArrayApplyRecoveryInfo with STANDBY_SNAPSHOT_READY only does
ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
StandbyReleaseOldLocks(running->xcnt, running->xids);
before short-circuiting. How comes we call KnownAssignedXidsAdd in that case?

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
  #6  
Old   
Andres Freund
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-23-2012 , 11:33 AM



On Wednesday, May 23, 2012 06:27:48 PM Andres Freund wrote:
Quote:
On Wednesday, May 23, 2012 12:30:31 PM Valentine Gogichashvili wrote:
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22
21:18:54 CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""

Could you provide a log with log_min_messages=DEBUG4? This will be
quite long though...

Hm... I have the data directory, but not the WAL files any more. Forgot
to back them up

Too bad, I think without more information we won't be able to fix this. I
feel bad for it but I actually you run into it again

Here is what I found strange, should somebody ever come back to the issue:
- the standby was in a consistent state
- according to the error annotation we were processing a XLOG_RUNNING_XACTS
- ProcArrayApplyRecoveryInfo with STANDBY_SNAPSHOT_READY only does
ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
StandbyReleaseOldLocks(running->xcnt, running->xids);
before short-circuiting. How comes we call KnownAssignedXidsAdd in that
case?
Looking over the code again reachedConsistency doesn't imply
STANDBY_SNAPSHOT_READY.

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
  #7  
Old   
Andres Freund
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-23-2012 , 11:40 AM



On Wednesday, May 23, 2012 12:30:31 PM Valentine Gogichashvili wrote:
Quote:
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22
21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633
latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts:
3674727041 3674727042 3674727128 3674726497",,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22

21:18:54

CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22

21:18:54

CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""

Could you provide a log with log_min_messages=DEBUG4? This will be quite
long
though...

Andres

Hm... I have the data directory, but not the WAL files any more. Forgot to
back them up
Do you have both, longrunning transactions and transactions with many
subtransactions?

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
  #8  
Old   
Valentine Gogichashvili
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 05-23-2012 , 12:13 PM



Hello Andres,

Do you have both, longrunning transactions and transactions with many
Quote:
subtransactions?

Yes long running plane SQL queries as well as many short SQL queries
running one after another in one long spanning transaction (we are working
on eliminating this kind of behavior, but it is some of the analysis tools
opening a transaction and then fetching lots of data with small queries).

Also many relatively short running calls (up to 500ms) to quite complicated
read-only stored procedures is being done constantly on that machine.

Regards,

-- Valentine

Reply With Quote
  #9  
Old   
Valentine Gogichashvili
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 06-07-2012 , 05:44 AM



Hallo again,

I have the situation again, one of 3 slaves was slow to play all the WAL
files and being about 10GB late it crashed with the same error again.

I collected DEBUG4 output in this time:
https://docs.google.com/open?id=0B2N...jNDbU0xQ3lvWms

I hope it will be helpful,

Regards,

-- Valentine Gogichashvili


On Wed, May 23, 2012 at 10:51 PM, Valentine Gogichashvili
<valgog (AT) gmail (DOT) com>wrote:

Quote:
But no subtransactions with writes (plpgsql + EXCEPTION also counts if does
DML)?


On the master, yes. Practically all the stored procedures there are
catching exceptions in their own bodies and returning result codes back.

Regards,

-- Valentine

Reply With Quote
  #10  
Old   
Andres Freund
 
Posts: n/a

Default Re: [BUGS] BUG #6661: out-of-order XID insertion in KnownAssignedXids - 06-07-2012 , 08:58 AM



Hi,

On Thursday, June 07, 2012 12:44:08 PM Valentine Gogichashvili wrote:
Quote:
I have the situation again, one of 3 slaves was slow to play all the WAL
files and being about 10GB late it crashed with the same error again.

I collected DEBUG4 output in this time:
https://docs.google.com/open?id=0B2N...jNDbU0xQ3lvWms
Ok, I stared at this some time and I think I see what the problem is. Some log
excerpts that lead my reasoning:

2012-06-06 15:35:51.954 "recovery snapshot waiting for non-overflowed snapshot
or until oldest active xid on standby is at least 3730302193 (now
3730301766)",,,,,"xlog redo running xacts: nextXid 3730302194
latestCompletedXid 3730302179 oldestRunningXid 3730301766; 68 xacts: ...
subxid ovf",,,,""
So we found a overflowed snapshot after we already are in
STANDBY_SNAPSHOT_PENDING. This implies we have seen an overflowed snapshot
before.
In STANDBY_SNAPSHOT_PENDING we start recording known assigned xids to build
our snapshot incrementally.

2012-06-06 15:35:51.954 record known xact 3730301766 latestObservedXid
3730302197
2012-06-06 15:37:49.580 record known xact 3730316333 latestObservedXid
3730316332 3730316517
Ok, we started filling the KnownAssignedXid machinery. That means procArray-
Quote:
numKnownAssignedXids != 0.

2012-06-06 15:37:54.401 FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3730316518
latestCompletedXid 3730316517 oldestRunningXid 3730316333; 4 xacts: 3730316429
3730316508 3730316333 3730316430",,,,""
the "xlog redo running xacts" bits tells us we have got a snapshot that is
*not* overflowed.
That means in ProcArrayAppylRecoveryInfo in the following part:
if (standbyState == STANDBY_SNAPSHOT_PENDING)
{
/*
* If the snapshot isn't overflowed or if its empty we can
* reset our pending state and use this snapshot instead.
*/
if (!running->subxid_overflow || running->xcnt == 0)
{
standbyState = STANDBY_INITIALIZED;
}
else
{
....
}
}
we reset our state back to STANDBY_INITIALIZED. And fall back to computing our
state at once in one piece instead of incrementally as we started doing for
SNAPSHOT_PENDING.
The problem is is know that that code assumes we run without any previous
recorded xids. This is even formalized in an assert which we don't hit because
were running without Assert:
....
Assert(standbyState == STANDBY_INITIALIZED);
....
Assert(procArray->numKnownAssignedXids == 0);
....
after that we start adding all currently running xids from the snapshot to the
KnownAssigned machinery. They are already recorded though, so we fail in
KnownAssignedXidsAdd with the OPs error.

The simplest fix for that seems to be to simply reset the KnownAssignedXids
state in the above branch. Any arguments against that?

This seems to have been broken in commit
10b7c686e52a6d1bb10194ebf9331ef06f044d46

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs (AT) postgresql (DOT) org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.