Re: DB environment complete lockout: is there a way to investigate? -
02-01-2007
, 06:50 AM
Hi All,
I do not use DB_THREAD option, but there rather is a convention where
each thread creates his own environment and database handles. So no
multiple threads may concurrently use the same environment or database
handle. That has some overhead in number of BerkeleyDB resources
involved, but in my pogram that is sufficient and on the other hand
there is no need to dynamically allocate memory for BerkeleyDB get()
results.
I made some more investigations recently, and found out that before
lockout of the environment the database gets locked at first blocking
in 'get' call, while environment still can be accessed. Stack trace of
the apllication blocked in 'get' call :
IOT/Abort trap in __lock_promote at 0x10065314 ($t11)
0x10065314 (__lock_promote+0x80) 7e5a0214 add r18,r26,r0
(dbx) where
__lock_promote() at 0x10065314
__lock_put_internal() at 0x1006480c
__lock_put_nolock_20_3() at 0x10066fd8
__lock_put() at 0x10065944
__db_c_close() at 0x10074cbc
__db_c_cleanup() at 0x10072308
__db_c_get() at 0x10074230
__db_c_pget() at 0x10072bd4
__db_pget() at 0x1007909c
__db_pget_pp() at 0x10079284
get__5TableFPvUls(0x20004d48, 0x2c49a420, 0x8, 0x10001) at 0x100615e0
At that point, new attempt to open environment succeeded, but db_open
blocked. From lock stat I could see that each time db_open is called,
"Total number of locks not immediately available due to conflicts" is
increased by one. There were 0 deadlocks. When application is blocked
in get() call, I assume that opening BDB 4.4+ environment by other
program with DB_REGISTER | DB_RECOVER flags would not help to avoid
outage, because application is blocked while having environment
already opened. I may be wrong here, though.
Complete environment lockout seems to occur later, when application
process has to be killed due to threads blocked in get() and some more
unsuccessful attempts are done to open the database that is locked
out. When that happens, program that can not access environment have
following stack trace :
IOT/Abort trap in _global_lock_common at 0xd0051630 ($t1)
0xd0051630 (_global_lock_common+0x200) 80410014 lwz
r2,0x14(r1)
(dbx) where
_global_lock_common(??, ??, ??) at 0xd0051630
pthread_mutex_lock(??) at 0xd004feb4
__db_pthread_mutex_lock() at 0x1000f1d8
__db_e_attach() at 0x1002c4d0
__dbenv_open() at 0x1002a7e0
attach__5TableFPCciUlT3N22(0x20000a38, 0x100bffb8, 0x1, 0x8, 0x8, 0x0,
0x0) at 0x100021a0
Lock stats looks as follows (db_stat -N option specified) :
Default locking region information:
105 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
5 Number of lock modes
5000 Maximum number of locks possible
5000 Maximum number of lockers possible
5000 Maximum number of lock objects possible
18 Number of current locks
21 Maximum number of locks at any one time
26 Number of current lockers
29 Maximum number of lockers at any one time
5 Number of current lock objects
6 Maximum number of lock objects at any one time
1798034 Total number of locks requested
1798015 Total number of locks released
0 Total number of lock requests failing because DB_LOCK_NOWAIT
was set
2 Total number of locks not immediately available due to
conflicts
0 Number of deadlocks
0 Lock timeout value
0 Number of locks that have timed out
0 Transaction timeout value
0 Number of transactions that have timed out
2MB 336KB The size of the lock region
60 The number of region locks that required waiting (0%)
....
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock REGINFO information:
Lock Region type
3 Region ID
__db.003 Region name
0x3004a000 Original region address
0x3004a000 Region address
0x3029df40 Region primary address
0 Region maximum allocation
0 Region allocated
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by object:
Locker Mode Count Status ----------------- Object
---------------
d READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
2b READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
31 READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
37 READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
3a READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
f READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
2d READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
33 READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
39 READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
3c READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
a READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
28 READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
2e READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
34 READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
2a READ 1 HELD (a f20022 6a517d89 19577c 0)
handle 0
30 READ 1 HELD (a f20022 6a517d89 19577c 0)
handle 0
36 READ 1 HELD (a f20022 6a517d89 19577c 0)
handle 0
2f READ 3 HELD 0x10f108 len: 20 data:
0000000x09000xf200"jP0xcb0x81000x170xd00xdc0000000 0
Would be interesting to find out what is wrong here, but my
understanding is, however, that at first place I need to find out why
application thread is getting blocked in get() call. That seems to be
the first symptom of things going wrong.
Again, thanks a lot for your replies.
Regards,
Oleksandr |