Designing for Recorery -
03-23-2006
, 01:49 PM
I have some question about recovery in a multi-process environment.
I am trying to design my applications so that if a problem occurs,
resulting in a RUN_RECOVERY error, or a DB_LOCK_NOTGRANTED error (for
a suspiciously long time-out value), then the collection of processes
can recover successfully and continue processing.
What I'm planning to do is use the DB_REGISTER method, and accompany
that with the following behavior:
if ( process sees a RUN_RECOVERY error or an error because of a
suspiciously long lock time-out )
{
env->close();
env->open( DB_RECOVER | DB_REGISTER );
}
I've looked at some of the code in dbenv_register.c, and think I
understand how DB_REGISTER works, but I've gotten lazy and haven't
tried to read all the details about recovery. The documentation states
that if one process starts doing a recovery on an environment which is
currently in-use by other processes, then it will cause all other
processes using the deleted environment to get an error. I've seen
this happen in my prototype once (by accident - I had a running process
and started another which used only DB_RECOVER without DB_REGISTER.)
What I'm hoping will happen is that the process that starts the
recovery will in turn cause all other processes to likewise get the
RUN_RECOVERY error, and execute the above code, resulting in a call to
dbenv->open( DB_REGISTER | DB_RECOVER ). At this point, the processes
may serialize based on the exclusive lock that the recovery process
uses on the
environment. Once the recover is done, the other processes will end up
successfully registering and conclude that recovery is not needed. At
this point the system is stable once again.
My questions about this are:
1) Am I correct in my assumptions above about the effect of starting
recovery on an environment that is in-use by other processes? (i.e.
the other processes start seeing RUN_RECOVERY errors?) I just don't
want them to see SEGVs or something equally nasty. I'm asking if it is
necessary to make all other processes disconnect from the environment
before recovery can safely begin.
2) Is there any risk of a peer process never being signalled to
disconnect from the environment? In other words, if a process sleeps
right through the error and subsequent recovery, and later makes a call
on it's old DbEnv* long after the recovery has completed, is it still
guaranteed to get the RUN_RECOVERY error so that it will know to
disconnect and reconnect?
Thanks,
Any answer to these questions is appreciated,
- Bob |