dbTalk Databases Forums  

[BUGS] PosgreSQL is crashing with a signal 11 - Bug?

mailing.database.pgsql-bugs mailing.database.pgsql-bugs


Discuss [BUGS] PosgreSQL is crashing with a signal 11 - Bug? in the mailing.database.pgsql-bugs forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Rafael Martinez Guerrero
 
Posts: n/a

Default [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-07-2004 , 12:30 PM






--=-zKSBTfyKK9ZZgU+4N8OZ
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

Hello

We have a problem with one of our central databases and we need your
help to find a solution.

----------------------
* Description:=20
----------------------
"PosgreSQL is crashing with a signal 11 and we do not think is a
hardware problem"

Since last week, we are having a big problem with one of our postgreSQL
installations. The database is not so big but it is used intensely with
different jobs running parallel transactions.

The first time the database crashed, we had been running 7.3.5 for a
long time without problems. Because the signal 11, we thought it was a
problem with defective memory, we changed RAM in the server and restored
the database from last backup. The memory was defective and we thought
we found the problem to our signal 11.

After some hours the database crashed again with the samme error. We did
not take any chances and moved the database to a new server. This did
not help and we got the samme problem after some hours. We updated to
7.3.7 hoping for the best but it did not help either.

Today it crashed again but this time we have logged more information
from the crash and we hope you can help us to find a solution to this
problem.=20

Below, some relevant information about our system. Please do not
hesitate to ask for more information if you need it.

------------------------
* OS/Machine/Filesystem:=20
------------------------
- Red Hat Enterprise Linux WS release 3 (Taroon Update 3)
kernel 2.4.21-15.0.3.ELsmp

- Dell 2650:=20
2 x Intel(R)Xeon(TM)CPU 2.40GHz
2GB RAM
PERCRAID Mirror 2 x 73GB

- LVM - ext3

-----------------------------
* Version / compilator / libc
-----------------------------
- PostgreSQL 7.3.7

- Options given to 'configure' script when PostgreSQL was built:
'--prefix=3D/local/opt/postgresql' '--mandir=3D/local/share/man'
'--with-openssl=3D/local' '--with-perl' '--with-java' 'CC=3Dcc-wrapper'
'LDFLAGS=3D-L/local/lib'

- gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-42)
- glibc-2.3.2-95.27 (RHEL3)

----------------------
* Configuration / DBSize
----------------------

- Relevant parameters that have been changed by us:

postgresql.conf:
max_connections =3D 600
superuser_reserved_connections =3D 2

shared_buffers =3D 8192
max_fsm_relations =3D 1000
max_fsm_pages =3D 20000
wal_buffers =3D 64=20

sort_mem =3D 2048
vacuum_mem =3D 32768=20
fsync =3D true

effective_cache_size =3D 131072=20
random_page_cost =3D 2

stats_start_collector =3D true
stats_command_string =3D true
stats_row_level =3D true
stats_block_level =3D true

autocommit=3Dfalse


/etc/sysctl.conf:
kernel.shmall =3D 134217728
kernel.shmmax =3D 134217728


-bash-2.05b# du -h
4.2M ./base/1
3.6M ./base/16975
4.0K ./base/63684339/pgsql_tmp
2.3G ./base/63684339
2.3G ./base
168K ./global
129M ./pg_xlog
8.6M ./pg_clog
2.4G .

-----------------------------------------------------------
* Information from CORE dump we got without --enable-debug.
-----------------------------------------------------------

This GDB was configured as "i386-redhat-linux-gnu"...(no debugging
symbols found)...Using host libthread_db library
"/lib/tls/libthread_db.so.1".

Program terminated with signal 11, Segmentation fault.

[......]
(gdb) bt
#0 0xb734d07c in memcpy () from /lib/tls/libc.so.6
#1 0x0806bba8 in DataFill ()
#2 0x0806c3ee in heap_formtuple ()
#3 0x080d1af1 in ExecTargetList ()
#4 0x080d1cdb in ExecProject ()
#5 0x080d1d7d in ExecScan ()
#6 0x080d5b5e in ExecIndexScan ()
#7 0x080cfd91 in ExecProcNode ()
#8 0x082fcd08 in ?? ()
#9 0x00000000 in ?? ()
#10 0x082ff120 in ?? ()
#11 0x00000000 in ?? ()
#12 0x082f9a78 in ?? ()
#13 0xbfff8028 in ?? ()
#14 0x080d6e5a in ExecMergeJoin ()
Previous frame inner to this frame (corrupt stack?)
---------------------------------------------------------

-----------------------------------------------------------
* Information from CORE dump we got with --enable-debug. We have
compiled a new version of postgres and run it through gdb with the core
dump we had/got from postgres without --enable-debug.=20
-----------------------------------------------------------

#0 0xb734d07c in memcpy () from /lib/tls/libc.so.6

#1 0x0806bba8 in DataFill (data=3D0xb7489000 <Address 0xb7489000 out of
bounds>, tupleDesc=3D0x82fd554, value=3D0x82fd550, nulls=3D0xbfff7ec0 " n =
",
infomask=3D0x836e904c, bit=3D0x836e904f "\003\f") at heaptuple.c:139

#2 0x0806c3ee in heap_formtuple (tupleDescriptor=3D0x82fd620,
value=3D0x82fd550, nulls=3D0xbfff7ec0 " n ") at heaptuple.c:623

#3 0x080d1af1 in ExecTargetList (targetlist=3D0x82fa250, nodomains=3D5,
targettype=3D0x82fd620, values=3D0x82fd550, econtext=3D0x82fd4e0,
isDone=3D0xbfff7f68) at execQual.c:2230

#4 0x080d1cdb in ExecScan (node=3D0x82fd528, accessMtd=3D0xbfff7f68) at
execScan.c:49

#5 0x080d1d7d in ExecScan (node=3D0x82fa140, accessMtd=3D0x80d58d4
<IndexNext+24>) at execScan.c:146

#6 0x080d5b5e in ExecIndexReScan (node=3D0x82fa140, exprCtxt=3D0xb72117d,
parent=3D0x0) at nodeIndexscan.c:284

#7 0x080d58d4 in IndexNext (node=3D0x0) at nodeIndexscan.c:87
Previous frame inner to this frame (corrupt stack?)

--------------------------------
* Relevant errors in the logfile:
--------------------------------
2004-09-07 15:59:08 [8100] LOG: statement: FETCH 1 FROM
"PgSQL_08456DE4"
2004-09-07 15:59:39 [7370] DEBUG: reaping dead processes
2004-09-07 15:59:39 [7370] DEBUG: child process (pid 8100) was
terminated by signal 11
2004-09-07 15:59:39 [7370] LOG: server process (pid 8100) was
terminated by signal 11
2004-09-07 15:59:39 [7370] LOG: terminating any other active server
processes
2004-09-07 15:59:39 [7370] DEBUG: CleanupProc: sending SIGQUIT to
process 7539
2004-09-07 15:59:39 [7370] DEBUG: CleanupProc: sending SIGQUIT to
process 7468
2004-09-07 15:59:39 [7370] DEBUG: CleanupProc: sending SIGQUIT to
process 7423
2004-09-07 15:59:39 [7468] WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
2004-09-07 15:59:39 [7423] WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
2004-09-07 15:59:39 [7539] WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
2004-09-07 15:59:39 [7370] DEBUG: reaping dead processes
2004-09-07 15:59:39 [7370] DEBUG: child process (pid 7423) exited
with exit code 1
2004-09-07 15:59:39 [7370] DEBUG: reaping dead processes
2004-09-07 15:59:39 [7370] DEBUG: child process (pid 7468) exited
with exit code 1
2004-09-07 15:59:39 [7370] DEBUG: reaping dead processes
2004-09-07 15:59:39 [7370] DEBUG: child process (pid 7539) exited
with exit code 1
2004-09-07 15:59:39 [7370] LOG: all server processes terminated;
reinitializing shared memory and semaphores
2004-09-07 15:59:39 [7370] DEBUG: shmem_exit(0)
2004-09-07 15:59:39 [7370] DEBUG: invoking
IpcMemoryCreate(size=3D78536704)
------------------------------------------------------------------

-----------------------------------------------------------------------
* The child process (pid 8100) that crashed with signal 11 logged these
statements (We have a complete log for pid 8100, but it has a size of
2MB):
-----------------------------------------------------------------------

2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100] LOG: statement: select version()
2004-09-07 15:58:50 [8100] LOG: statement: select version()
2004-09-07 15:58:50 [8100] LOG: statement: select version()
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: SET CLIENT_ENCODING TO
'ISO_8859_1'
2004-09-07 15:58:50 [8100] LOG: statement: SET CLIENT_ENCODING TO
'ISO_8859_1'
2004-09-07 15:58:50 [8100] LOG: statement: SET CLIENT_ENCODING TO
'ISO_8859_1'
2004-09-07 15:58:50 [8100] LOG: statement: COMMIT WORK
2004-09-07 15:58:50 [8100] LOG: statement: COMMIT WORK
2004-09-07 15:58:50 [8100] LOG: statement: COMMIT WORK
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: BEGIN WORK
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_user@ifi'
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_user@ifi'
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_user@ifi'
2004-09-07 15:58:50 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1700
2004-09-07 15:58:50 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1700
2004-09-07 15:58:50 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1700
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_fg@ifi'
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_fg@ifi'
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_fg@ifi'
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code , shell FROM posix_shell_code
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code , shell FROM posix_shell_code
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code , shell FROM posix_shell_code
2004-09-07 15:58:50 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1043
2004-09-07 15:58:50 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1043
2004-09-07 15:58:50 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1043
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT count ( ah . account_id ) , ah . spread , di . disk_id
, di . host_id , di . path FROM disk_info di LEFT JOIN account_home ah
ON di . disk_id =3D ah . disk_id AND ah . spread =3D 105 LEFT JOIN
account_info ai ON ah . account_id =3D ai . account_id GROUP BY di .
disk_id , di . host_id , di . path , ah . spread
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT count ( ah . account_id ) , ah . spread , di . disk_id
, di . host_id , di . path FROM disk_info di LEFT JOIN account_home ah
ON di . disk_id =3D ah . disk_id AND ah . spread =3D 105 LEFT JOIN
account_info ai ON ah . account_id =3D ai . account_id GROUP BY di .
disk_id , di . host_id , di . path , ah . spread
2004-09-07 15:58:50 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT count ( ah . account_id ) , ah . spread , di . disk_id
, di . host_id , di . path FROM disk_info di LEFT JOIN account_home ah
ON di . disk_id =3D ah . disk_id AND ah . spread =3D 105 LEFT JOIN
account_info ai ON ah . account_id =3D ai . account_id GROUP BY di .
disk_id , di . host_id , di . path , ah . spread
2004-09-07 15:58:53 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 20
2004-09-07 15:58:53 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 20
2004-09-07 15:58:53 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 20
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM value_domain_code WHERE code_str =3D
'account_names'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM value_domain_code WHERE code_str =3D
'account_names'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM value_domain_code WHERE code_str =3D
'account_names'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authentication_code WHERE code_str =3D
'crypt3-DES'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authentication_code WHERE code_str =3D
'crypt3-DES'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authentication_code WHERE code_str =3D
'crypt3-DES'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authoritative_system_code WHERE code_str =3D
'Cached'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authoritative_system_code WHERE code_str =3D
'Cached'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authoritative_system_code WHERE code_str =3D
'Cached'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM person_name_code WHERE code_str =3D 'FULL'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM person_name_code WHERE code_str =3D 'FULL'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM person_name_code WHERE code_str =3D 'FULL'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_08456DE4"
CURSOR FOR SELECT ai . account_id , posix_uid , shell , gecos ,
entity_name , aa . auth_data , pg . posix_gid , pn . name , eq .
quarantine_type , eq . start_date , eq . disable_until , eq . end_date ,
ah . home , ah . disk_id FROM posix_user pu LEFT JOIN entity_quarantine
eq ON pu . account_id =3D eq . entity_id JOIN entity_spread es ON pu .
account_id =3D es . entity_id AND ( es . spread =3D 105 ) JOIN account_home
ah ON es . entity_id =3D ah . account_id JOIN account_info ai ON ai .
account_id =3D pu . account_id LEFT JOIN person_name pn ON pn . person_id
=3D ai . owner_id AND pn . source_system =3D 92 AND pn . name_variant =3D 1=
17
JOIN posix_group pg ON pu . gid =3D pg . group_id LEFT JOIN
account_authentication aa ON aa . account_id =3D pu . account_id AND aa .
method =3D 12 JOIN entity_name en ON en . entity_id =3D pu . account_id AND
en . value_domain =3D 14 ORDER BY ai . account_id
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_08456DE4"
CURSOR FOR SELECT ai . account_id , posix_uid , shell , gecos ,
entity_name , aa . auth_data , pg . posix_gid , pn . name , eq .
quarantine_type , eq . start_date , eq . disable_until , eq . end_date ,
ah . home , ah . disk_id FROM posix_user pu LEFT JOIN entity_quarantine
eq ON pu . account_id =3D eq . entity_id JOIN entity_spread es ON pu .
account_id =3D es . entity_id AND ( es . spread =3D 105 ) JOIN account_home
ah ON es . entity_id =3D ah . account_id JOIN account_info ai ON ai .
account_id =3D pu . account_id LEFT JOIN person_name pn ON pn . person_id
=3D ai . owner_id AND pn . source_system =3D 92 AND pn . name_variant =3D 1=
17
JOIN posix_group pg ON pu . gid =3D pg . group_id LEFT JOIN
account_authentication aa ON aa . account_id =3D pu . account_id AND aa .
method =3D 12 JOIN entity_name en ON en . entity_id =3D pu . account_id AND
en . value_domain =3D 14 ORDER BY ai . account_id
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_08456DE4"
CURSOR FOR SELECT ai . account_id , posix_uid , shell , gecos ,
entity_name , aa . auth_data , pg . posix_gid , pn . name , eq .
quarantine_type , eq . start_date , eq . disable_until , eq . end_date ,
ah . home , ah . disk_id FROM posix_user pu LEFT JOIN entity_quarantine
eq ON pu . account_id =3D eq . entity_id JOIN entity_spread es ON pu .
account_id =3D es . entity_id AND ( es . spread =3D 105 ) JOIN account_home
ah ON es . entity_id =3D ah . account_id JOIN account_info ai ON ai .
account_id =3D pu . account_id LEFT JOIN person_name pn ON pn . person_id
=3D ai . owner_id AND pn . source_system =3D 92 AND pn . name_variant =3D 1=
17
JOIN posix_group pg ON pu . gid =3D pg . group_id LEFT JOIN
account_authentication aa ON aa . account_id =3D pu . account_id AND aa .
method =3D 12 JOIN entity_name en ON en . entity_id =3D pu . account_id AND
en . value_domain =3D 14 ORDER BY ai . account_id
2004-09-07 15:58:54 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1082
2004-09-07 15:58:54 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1082
2004-09-07 15:58:54 [8100] LOG: statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1082
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_brk'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_brk'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_brk'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'permisjon'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'permisjon'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'permisjon'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autointsomm'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autointsomm'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autointsomm'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_sh'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_sh'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_sh'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_nystuden'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_nystuden'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_nystuden'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autoekstern'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autoekstern'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autoekstern'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'autostud'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'autostud'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'autostud'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'system'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'system'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'system'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_ftpuser'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_ftpuser'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_ftpuser'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autopassord'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autopassord'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autopassord'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'teppe'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'teppe'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'teppe'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'generell'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'generell'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'generell'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'slutta'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'slutta'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'slutta'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_stengt'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_stengt'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_stengt'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'svakt_passord'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'svakt_passord'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'svakt_passord'
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'nologin'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'nologin'
2004-09-07 15:58:54 [8100] LOG: statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'nologin'
------------------------------------------------------

Thanks in advance, any help will be appreciated.

--=20
Rafael Martinez, <r.m.guerrero (AT) usit (DOT) uio.no>
Center for Information Technology Services
University of Oslo, Norway


--=-zKSBTfyKK9ZZgU+4N8OZ
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQBBPe9/BhuKQurGihQRAqDYAJ4gRgWeTUKx8IEEg5ntdKw1tQsH1ACcCJ Wr
BAP4VJoJxy8oqUNOmrEi3Bs=
=wkAN
-----END PGP SIGNATURE-----

--=-zKSBTfyKK9ZZgU+4N8OZ--


Reply With Quote
  #2  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-07-2004 , 12:59 PM






Rafael Martinez Guerrero <r.m.guerrero (AT) usit (DOT) uio.no> writes:
Quote:
* Information from CORE dump we got with --enable-debug. We have
compiled a new version of postgres and run it through gdb with the core
dump we had/got from postgres without --enable-debug.=20
Okay, theoretically that works, but it might be smarter to install the
debug build and get a fresh core dump that definitely corresponds to it.

Quote:
#0 0xb734d07c in memcpy () from /lib/tls/libc.so.6

#1 0x0806bba8 in DataFill (data=3D0xb7489000 <Address 0xb7489000 out of
bounds>, tupleDesc=3D0x82fd554, value=3D0x82fd550, nulls=3D0xbfff7ec0 " n =
",
infomask=3D0x836e904c, bit=3D0x836e904f "\003\f") at heaptuple.c:139
If accurate, that says it's crashing here:

/* fixed-length pass-by-reference */
Assert(att[i]->attlen > 0);
data_length = att[i]->attlen;
--> memcpy(data, DatumGetPointer(value[i]), data_length);

which suggests either that att[i]->attlen is corrupt, or that the
computed length for the preceding column was wacko (leading to the
data pointer being moved to a silly address), or that the provided
value[i] is wrong. In the context at hand none of these seem especially
likely, but one of them must be the case. Can you look with jdb to
see what the value of i is, and print out the contents of the *(att[i])
struct? Also look at "data" and "value[i]" to see if they are sensible
pointers or not.

How reproducible is the crash --- does it happen every time you execute
this particular FETCH?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html


Reply With Quote
  #3  
Old   
Rafael Martinez
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-07-2004 , 04:22 PM



On Tue, 2004-09-07 at 19:58, Tom Lane wrote:

Quote:
Rafael Martinez Guerrero <r.m.guerrero (AT) usit (DOT) uio.no> writes:
* Information from CORE dump we got with --enable-debug. We have
compiled a new version of postgres and run it through gdb with the core
dump we had/got from postgres without --enable-debug.=20

Okay, theoretically that works, but it might be smarter to install the
debug build and get a fresh core dump that definitely corresponds to it.

It is late in Norway and we need to sleep, we will try this tomorrow
morning.


Quote:
#0 0xb734d07c in memcpy () from /lib/tls/libc.so.6

#1 0x0806bba8 in DataFill (data=3D0xb7489000 <Address 0xb7489000 out of
bounds>, tupleDesc=3D0x82fd554, value=3D0x82fd550, nulls=3D0xbfff7ec0 " n =
",
infomask=3D0x836e904c, bit=3D0x836e904f "\003\f") at heaptuple.c:139

If accurate, that says it's crashing here:

/* fixed-length pass-by-reference */
Assert(att[i]->attlen > 0);
data_length = att[i]->attlen;
--> memcpy(data, DatumGetPointer(value[i]), data_length);

which suggests either that att[i]->attlen is corrupt, or that the
computed length for the preceding column was wacko (leading to the
data pointer being moved to a silly address), or that the provided
value[i] is wrong. In the context at hand none of these seem especially
likely, but one of them must be the case. Can you look with jdb to
see what the value of i is, and print out the contents of the *(att[i])
struct? Also look at "data" and "value[i]" to see if they are sensible
pointers or not.

I got this from one of our developers (from the core dump generated by
7.3.7 without --enable-debug):
--------------------------------------
(gdb) inspect i
$1 = 1

(gdb) inspect att[i]
$2 = 0x82fd6e8

(gdb) inspect *att[i]
$3 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
attnum = 2, attndims = 0, attcacheoff = -1, atttypmod = 393220, attbyval
= 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
attislocal = 1 '\001', attinhcount = 0}

(gdb) inspect data
$4 = 0xb7489000 <Address 0xb7489000 out of bounds>

(gdb) inspect value[i]
$5 = 3054556648


Quote:
How reproducible is the crash --- does it happen every time you execute
this particular FETCH?

We are not sure about this. We did not log as much as we should in the
beginning. One thing is sure, the last time, it happens after this
FETCH. We have full logging on now and we will be able to know more
about this if/when it crash again.



Quote:
regards, tom lane

Thanks for your help. I hope you/we will be able to find out this, right
now is a big crisis for us.

--
Rafael Martinez, <r.m.guerrero (AT) usit (DOT) uio.no>
Center for Information Technology Services
University of Oslo, Norway



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match


Reply With Quote
  #4  
Old   
Rafael Martinez
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-07-2004 , 04:51 PM



On Tue, 2004-09-07 at 23:36, Tom Lane wrote:
Quote:
Rafael Martinez <r.m.guerrero (AT) usit (DOT) uio.no> writes:
I got this from one of our developers (from the core dump generated by
7.3.7 without --enable-debug):

(gdb) inspect *att[i]
$3 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
attnum = 2, attndims = 0, attcacheoff = -1, atttypmod = 393220, attbyval
= 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
attislocal = 1 '\001', attinhcount = 0}

That looks reasonable ...

(gdb) inspect data
$4 = 0xb7489000 <Address 0xb7489000 out of bounds

(gdb) inspect value[i]
$5 = 3054556648

Hmm, what do you get from "x/10 3054556648" ? Also, it'd be worth
looking at the contents of *att[0] to see if that's also sensible,
as well as value[0] and wherever that points (if it's a pointer).

regards, tom lane
(gdb) x/10 3054556648
0xb610d5e8: 0x2f00000c 0x00000002 0x30170000
0x020c6172
0xb610d5f8: 0x00000000 0x00000000 0x00ae0000
0x0006002b
0xb610d608: 0x2f1c0913 0x0404b70b

(gdb) inspect att[0]
$1 = 0x82fd660

(gdb) inspect *att[0]
$2 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
attnum = 1, attndims = 0, attcacheoff = 0, atttypmod = 786436, attbyval
= 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
attislocal = 1 '\001', attinhcount = 0}

(gdb) inspect value[0]
$3 = 3054556612

(gdb) inspect *value[0]
$4 = 12


--
Rafael Martinez, <r.m.guerrero (AT) usit (DOT) uio.no>
Center for Information Technology Services
University of Oslo, Norway



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org


Reply With Quote
  #5  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-07-2004 , 05:32 PM



Rafael Martinez <r.m.guerrero (AT) usit (DOT) uio.no> writes:
Quote:
Hmm, what do you get from "x/10 3054556648" ?

(gdb) x/10 3054556648
0xb610d5e8: 0x2f00000c 0x00000002 0x30170000 0x020c6172
0xb610d5f8: 0x00000000 0x00000000 0x00ae0000 0x0006002b
0xb610d608: 0x2f1c0913 0x0404b70b
Well, that's certainly not a sensible first word for a numeric field;
the first word should be a length and this obviously isn't.

A reasonable theory at this point is that the data on disk for this
table have gotten corrupted, probably in the way of a bad length value
for whatever field(s) lie between the two that are being extracted here.
That could result in a miscomputed address for the next field, which
seems to be what we're looking at.

What I would suggest doing next is backtracking to find out which
physical tuple this is on which disk page, and then dumping that out
with pg_filedump (or your tool of choice) so that we can verify or
disprove the hypothesis of bad stored data. If it is bad data, we'll
want to examine the whole page anyway to see if we can see any pattern
of corruption.

You should be able to find out the physical tuple involved by looking at
the "ecxt_scantuple" field of ExecTargetList's econtext parameter. Its
"val" field should point to something like this:

(gdb) p *econtext->ecxt_scantuple->val
$3 = {t_len = 276, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1},
t_tableOid = 863135, t_datamcxt = 0x0, t_data = 0xc2c0fc48}

t_tableOid is the source table OID, ip_blkid is the page number (divided
into high and low 16-bit halves for arcane reasons), and ip_posid is the
tuple number on that page. You can also look at *t_data for additional
confirmation of what you are dealing with:

(gdb) p *econtext->ecxt_scantuple->val->t_data
$4 = {t_choice = {t_heap = {t_xmin = 42833, t_cmin = 0, t_xmax = 863136,
t_field4 = {t_cmax = 0, t_xvac = 0}}, t_datum = {datum_len = 42833,
datum_typmod = 0, datum_typeid = 863136}}, t_ctid = {ip_blkid = {
bi_hi = 0, bi_lo = 0}, ip_posid = 1}, t_natts = 16, t_infomask = 2320,
t_hoff = 32 ' ', t_bits = ""}

I'm using CVS tip to prepare this example, so the field layout is not
the same as what you'll see in 7.4, but there will be a t_ctid field
and it will probably have the same contents as what you saw in the
scantuple struct.

Once you have the table OID, discover its file node number:

regression=# select relfilenode from pg_class where oid = 863135;
relfilenode
-------------
863135
(1 row)

(These will often be the same, but don't assume so without verifying.)
And look up your database OID:

regression=# select oid from pg_database where datname = 'mydb';

Now the file you want to look at is $PGDATA/base/dboid/relfilenode.

If you are using pg_filedump (see http://sources.redhat.com/rhdb/)
then I'd recommend a command along the lines of

pg_filedump -i -f -R pagenum $PGDATA/base/dboid/relfilenode

to dump the page in the most useful format.

We'll need to know the table schema ("\d tabname") also to interpret
what's in the dump.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo (AT) postgresql (DOT) org


Reply With Quote
  #6  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-07-2004 , 07:52 PM



Rafael Martinez <r.m.guerrero (AT) usit (DOT) uio.no> writes:
Quote:
I got this from one of our developers (from the core dump generated by
7.3.7 without --enable-debug):

(gdb) inspect *att[i]
$3 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
attnum = 2, attndims = 0, attcacheoff = -1, atttypmod = 393220, attbyval
= 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
attislocal = 1 '\001', attinhcount = 0}
That looks reasonable ...

Quote:
(gdb) inspect data
$4 = 0xb7489000 <Address 0xb7489000 out of bounds

(gdb) inspect value[i]
$5 = 3054556648
Hmm, what do you get from "x/10 3054556648" ? Also, it'd be worth
looking at the contents of *att[0] to see if that's also sensible,
as well as value[0] and wherever that points (if it's a pointer).

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org


Reply With Quote
  #7  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-08-2004 , 02:30 PM



Rafael Martinez <r.m.guerrero (AT) usit (DOT) uio.no> writes:
Quote:
I send you the information you ask for.
This is really interesting. Looking at tuple 44 in the pg_filedump
output:

Quote:
Item 44 -- Length: 88 Offset: 4016 (0x0fb0) Flags: USED
XID: min (34365810) CMIN|XMAX: 0 CMAX|XVAC: 0
Block Id: 174 linp Index: 44 Attributes: 6 Size: 28
infomask: 0x0913
(HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_IN VALID)
t_bits: [0]: 0x2f

0fb0: 72610c02 00000000 00000000 0000ae00 ra..............
0fc0: 2c000600 13091c2f 0cb70404 0c000000 ,....../........
0fd0: 04000000 00004869 0c000000 03000000 ......Hi........
0fe0: 00009681 0c000000 03000000 00001002 ................
0ff0: 0c000000 02000000 00001230 0c000000 ...........0....
1000: 02000000 00001730 .......0
whereas what we saw in the gdb output was:

Quote:
(gdb) x/10 3054556648
0xb610d5e8: 0x2f00000c 0x00000002 0x30170000 0x020c6172
0xb610d5f8: 0x00000000 0x00000000 0x00ae0000 0x0006002b
0xb610d608: 0x2f1c0913 0x0404b70b
This evidently corresponds to data starting at offset 0ffc on the disk
page (the last few bytes of the gdb output match the start of tuple 43,
which is in the next higher part of the page --- note that the bytes are
being printed in opposite orders by pg_filedump and gdb). So somehow,
the byte at page offset 0fff got changed from 00 to 2f in memory, though
not on disk. If you still have the core dump, would you look at the
area surrounding address 3054556648 and see if there are any other
discrepancies between that and what is on disk?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo (AT) postgresql (DOT) org


Reply With Quote
  #8  
Old   
Tom Lane
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-08-2004 , 03:55 PM



Kjetil Torgrim Homme <kjetilho (AT) ifi (DOT) uio.no> writes:
Quote:
[Tom Lane]:
So somehow, the byte at page offset 0fff got changed
from 00 to 2f in memory, though not on disk.

indeed interesting. IMHO 0x0fff sounds more like a write to -1
(relative to the next page) than a random memory error.
Note though that that offset is only special with respect to locations
on disk; it was not a special address in memory. I'm wondering about
transient flakiness in your disk drive controller causing it to
sometimes return bad data. Have you run any read/write disk tests?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
message can get through to the mailing list cleanly


Reply With Quote
  #9  
Old   
Kjetil Torgrim Homme
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-08-2004 , 05:32 PM



[Tom Lane]:
Quote:
Kjetil Torgrim Homme <kjetilho (AT) ifi (DOT) uio.no> writes:
indeed interesting. IMHO 0x0fff sounds more like a write to -1
(relative to the next page) than a random memory error.

Note though that that offset is only special with respect to
locations on disk; it was not a special address in memory. I'm
wondering about transient flakiness in your disk drive controller
causing it to sometimes return bad data. Have you run any
read/write disk tests?
well, the hardware RAID1 controller claims the disks are healthy, but
of course it would. I was unable to find any software to do such
testing, the closest I got was badblocks(8), but it only wants to work
on a block device.

I hacked good old bonnie to write blocks of a random byte, then read
them back and see if they match. I'll run a few such processes
concurrently, we'll see if it suffices to trigger (and discover) any
hardware bugs.
--
Kjetil T.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
message can get through to the mailing list cleanly


Reply With Quote
  #10  
Old   
Kjetil Torgrim Homme
 
Posts: n/a

Default Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug? - 09-10-2004 , 07:45 AM



we got a new coredump of 7.3.7 today. this instance was running on a
freshly installed computer, to eliminate(?) all hardware issues. it's
still the same brand and model, though. the old system has been
running hard disk tests 30+ hours with no errors yet.

the core dump happens at the same place in the code, and this time we
got a complete backtrace:

(gdb) bt
#0 0xb734d07c in memcpy () from /lib/tls/libc.so.6
#1 0x0806bba8 in DataFill (data=3D0xb7488fff "", tupleDesc=3D0x82899a0,=20
value=3D0x8289980, nulls=3D0xbfffd3c0 " n \"", infomask=3D0x8806b=
04c,=20
bit=3D0x8806b04f "=EF\001") at heaptuple.c:139
#2 0x0806c3ee in heap_formtuple (tupleDescriptor=3D0x8279ec0, value=3D0x82=
89980,=20
nulls=3D0xbfffd3c0 " n \"") at heaptuple.c:623
#3 0x080d1af1 in ExecTargetList (targetlist=3D0x8278298, nodomains=3D9,=20
targettype=3D0x8279ec0, values=3D0x8289980, econtext=3D0x8279a60,=20
isDone=3D0xbfffd468) at execQual.c:2230
#4 0x080d1cdb in ExecScan (node=3D0x827a208, accessMtd=3D0xbfffd468)
at execScan.c:49
#5 0x080d1d7d in ExecScan (node=3D0x8278c70, accessMtd=3D0x80d7c58 <SeqNex=
t+24>)
at execScan.c:146
#6 0x080d7cfb in InitScanRelation (node=3D0x82899a0, estate=3D0x8278c70,=
=20
scanstate=3D0xbfffd4c8) at nodeSeqscan.c:162
#7 0x080cfd86 in ExecProcNode (node=3D0x8289bf8, parent=3D0x0)
at execProcnode.c:315
#8 0x080cecf3 in ExecutePlan (estate=3D0x8279c90, plan=3D0x8278c70,=20
operation=3DCMD_SELECT, numberTuples=3D0, direction=3D136878496,=20
destfunc=3D0x82899c8) at execMain.c:964
#9 0x080ce392 in ExecutorEnd (queryDesc=3D0x82899a0, estate=3D0x0)
at execMain.c:223
#10 0x0811d069 in ProcessQuery (parsetree=3D0x82899c8, plan=3D0x8278c70,=20
dest=3DRemote, completionTag=3D0xbfffd610 "") at pquery.c:251
#11 0x0811b7ed in pg_exec_query_string (query_string=3D0xbfffd610, dest=3DR=
emote,=20
parse_context=3D0x823d610) at postgres.c:844
#12 0x0811c64d in PostgresMain (argc=3D4, argv=3D0xbfffd850,=20
username=3D0x8238c69 "cerebrum") at postgres.c:2018
#13 0x0810413d in DoBackend (port=3D0x8238b38) at postmaster.c:2304
#14 0x08103cb2 in BackendStartup (port=3D0x8238b38) at postmaster.c:1935
#15 0x08102dad in ServerLoop () at postmaster.c:1016
#16 0x081027ea in PostmasterMain (argc=3D1, argv=3D0x8220170) at postmaster=
..c:797
#17 0x080e1234 in main (argc=3D1, argv=3D0xbfffe204) at main.c:217



(gdb) print *att[i]
$20 =3D {attrelid =3D 0, attname =3D {
data =3D "pageunits_total", '\0' <repeats 48 times>,=20
alignmentDummy =3D 1701273968}, atttypid =3D 1700, attstattarget =3D -1=
,=20
attlen =3D -1, attnum =3D 9, attndims =3D 0, attcacheoff =3D -1, atttypmo=
d =3D 393220,=20
attbyval =3D 0 '\0', attstorage =3D 109 'm', attisset =3D 0 '\0',=20
attalign =3D 105 'i', attnotnull =3D 0 '\0', atthasdef =3D 0 '\0',=20
attisdropped =3D 0 '\0', attislocal =3D 1 '\001', attinhcount =3D 0}
(gdb) print i
$21 =3D 8
(gdb) x/10 value[i]
0xb7190928: 0x2f00000b 0x00000000 0x00200000 0x00000207
0xb7190938: 0x00000314 0x01bf913d 0x10120000 0x00090020
0xb7190948: 0xef201553 0x00000001


the relevant code again is:

if (att[i]->attbyval)
[...]
else if (att[i]->attlen =3D=3D -1)
[...]
else if (att[i]->attlen =3D=3D -2)
[...]
else
{
/* fixed-length pass-by-reference */
Assert(att[i]->attlen > 0);
data_length =3D att[i]->attlen;
=3D=3D=3D> memcpy(data, DatumGetPointer(value[i]), data_length);
}

(gdb) print data_length
$25 =3D 788529163
(gdb) print att[i]->attlen
$26 =3D -1

how can att[i]->attlen possibly change in the interim? but
data_length looks corrupted, too.

(gdb) print *att[i-1]
$27 =3D {attrelid =3D 0, attname =3D {
data =3D "pageunits_paid", '\0' <repeats 49 times>,=20
alignmentDummy =3D 1701273968}, atttypid =3D 1700, attstattarget =3D -1=
,=20
attlen =3D -1, attnum =3D 8, attndims =3D 0, attcacheoff =3D -1, atttypmo=
d =3D 393220,=20
attbyval =3D 0 '\0', attstorage =3D 109 'm', attisset =3D 0 '\0',=20
attalign =3D 105 'i', attnotnull =3D 0 '\0', atthasdef =3D 0 '\0',=20
attisdropped =3D 0 '\0', attislocal =3D 1 '\001', attinhcount =3D 0}

also:

(gdb) print data
$39 =3D 0xb7488fff ""

which doesn't seem very aligned for an integer.

(gdb) print data[1]
Cannot access memory at address 0xb7489000

thank you for any insights.
--=20
Kjetil T.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
message can get through to the mailing list cleanly

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.