dbTalk Databases Forums  

Application lockup (weird locker id, 4.3.28)

comp.databases.berkeley-db comp.databases.berkeley-db


Discuss Application lockup (weird locker id, 4.3.28) in the comp.databases.berkeley-db forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Mika Iisakkila
 
Posts: n/a

Default Application lockup (weird locker id, 4.3.28) - 01-23-2006 , 08:40 AM






I'm trying to debug seemingly random application level lockups with
DB 4.3.28 (on HP-UX 11.23). Can anyone explain what is going on here:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Default locking region information:
118035 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
9 Number of lock modes
50000 Maximum number of locks possible
50000 Maximum number of lockers possible
50000 Maximum number of lock objects possible
60 Number of current locks
82 Maximum number of locks at any one time
81 Number of current lockers
111 Maximum number of lockers at any one time

[...]

1cc12 READ 1 PENDING /mail/etc/mailboxes.db page 971
8008fd6c WRITE 1 WAIT /mail/etc/mailboxes.db page 971
1cc3c READ 1 WAIT /mail/etc/mailboxes.db page 971
1cc44 READ 1 WAIT /mail/etc/mailboxes.db page 971
1cc5a READ 1 WAIT /mail/etc/mailboxes.db page 971
1cc8e READ 1 WAIT /mail/etc/mailboxes.db page 971
1ccaa READ 1 WAIT /mail/etc/mailboxes.db page 971

I believe this could be a bug in the application leading to a locking
problem, but isn't that locker ID waiting for a WRITE impossibly large?

The entire output of db_stat -E (after shutting down the application)
is available here:

http://users.tkk.fi/~iisakkil/temp/db_stat.txt

DB was compiled with gcc 3.4.4, with no other notable configuration
options than --with-mutex=HP/msem_init.
--
http://www.hut.fi/u/iisakkil/ --Foo.

Reply With Quote
  #2  
Old   
ubell@sleepycat.com
 
Posts: n/a

Default Re: Application lockup (weird locker id, 4.3.28) - 01-23-2006 , 01:40 PM






Mika,

Locker ids in the "top half" of the range belong to transactions while
those in the "bottom half" are for non-transactional cursor operations.
The fact that you have a PENDING lock means that the thread which is
using locker id 1cc12 has been granted the lock but has not been
scheduled yet. It should be difficult to see a lock in this state
unless the thread has exited or there is some problem with the thread
scheduler that is preventing it from running. Note that threads should
not handle interrupts while waiting on events inside the Berkeley DB
library unless they return from the interrupt without blocking or
making other Berkeley DB library calls.

Michael Ubell
Sleepycat Software.


Reply With Quote
  #3  
Old   
Mika Iisakkila
 
Posts: n/a

Default Re: Application lockup (weird locker id, 4.3.28) - 01-26-2006 , 04:25 AM



ubell (AT) sleepycat (DOT) com writes:
Quote:
Locker ids in the "top half" of the range belong to transactions while
those in the "bottom half" are for non-transactional cursor operations.
Thanks, it makes a lot more sense now.

Quote:
The fact that you have a PENDING lock means that the thread which is
using locker id 1cc12 has been granted the lock but has not been
scheduled yet. It should be difficult to see a lock in this state
unless the thread has exited or there is some problem with the thread
scheduler that is preventing it from running.
I've been debugging the situation (it can take a couple of days to
reproduce...) and I'm beginning to think it must be some kind of a
mutex problem in BDB or my build of it.

When the app gets stuck, the processes, all single-threaded, get stuck
waiting on a semaphore (msem_lock() keeps returning EAGAIN). Even
after I shut down all processes, the situation persists -- e.g db_stat
hangs doing the same thing. Application recovery makes it running
again, but of course the database locks remain. I'm planning to try
again, this time with --with-mutex=HPPA/gcc-assembly, but I'm afraid
I'd be only masking a problem.

Using the pthread library for mutexes didn't work; the utilities spew
errors like "db_stat: unable to lock mutex: Invalid argument" and the
database gets corrupted during concurrent access. I recall db-3.3
worked with pthreads, but that was using HP's compiler, I have gcc now.

The application BTW is Cyrus IMAP 2.2.12, and the problem is in the
transactional mailbox database. I believe it has seen so much use that
I wouldn't be the only one experiencing a database handling bug, if
there is one. All processes seem to exit cleanly as designed and I
don't see anything interesting in the logs.

Quote:
Note that threads should
not handle interrupts while waiting on events inside the Berkeley DB
library unless they return from the interrupt without blocking or
making other Berkeley DB library calls.
Just for clarity, are we speaking "threads" as in real multithreaded
applications, or "threads of control" as in several separate processes?
I'm reviewing the code, but I don't think it does that.
--
http://www.hut.fi/u/iisakkil/ --Foo.


Reply With Quote
  #4  
Old   
ubell@sleepycat.com
 
Posts: n/a

Default Re: Application lockup (weird locker id, 4.3.28) - 01-31-2006 , 11:18 AM




Mika Iisakkila wrote:
Quote:
When the app gets stuck, the processes, all single-threaded, get stuck
waiting on a semaphore (msem_lock() keeps returning EAGAIN).
msem_lock is would indicate that you are using test and set mutexes
not pthread mutexes.

Quote:
Just for clarity, are we speaking "threads" as in real multithreaded
applications, or "threads of control" as in several separate processes?
This wold be true of any thread of control. You cannot have signal
handlers that do recursive calls into the BDB library.

Michael Ubell
Sleepycat Software.



Reply With Quote
  #5  
Old   
Mika Iisakkila
 
Posts: n/a

Default Re: Application lockup (weird locker id, 4.3.28) - 02-01-2006 , 06:43 AM



ubell (AT) sleepycat (DOT) com writes:
Quote:
msem_lock is would indicate that you are using test and set mutexes
not pthread mutexes.
Yes, HP/msem_init. pthreads didn't work -- they compile fine, but
even on a freshly created database, the utilities spew errors like
"db_stat: unable to lock mutex: Invalid argument" and the database
got corrupted as soon as I got concurrent access running. The DB-4.3
binary packages for PA-RISC / HP-UX 11.23 available on
hpux.connect.org.uk had the same problem, which is why I took on
compiling the libraries myself in the first place.

I've been running the test suite for 20 hours now. If that doesn't
reveal anything, I'll probably try with another mutex implementation.
So far I've got "architecture does not support locks inside system
shared memory", but the intended application doesn't use
DB_ENV_SYSTEM_MEM so I guess that's not a problem.

HPPA/gcc-assembly apparently compiles into a single machine
instruction, so I suppose it should be fast, but what other
implications does the mutex selection have?
--
http://www.hut.fi/u/iisakkil/ --Foo.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.