Asynchronous Replication - 02-18-2005 , 12:57 AM
I have been trying to understand how replication works in a database
I want to use TCP to send updates to a client environment, but I do not
want to slow down the primary threads of control.
My reading of the replication documentation is that the transmission of
updates occurs synchronously within the thread doing the update to the
master. Is this correct?
Is it possible to start a whole other process, possibly incrementally
that attaches to the replication enabled environment then uses TCP to
do all or part of the synchronization?
If not would the best way to achieve this be to create a pipe to
another process and have the synchronous synchronization code write to
the pipe, then have an outboard process read from the pipe and send the
TCP data to the client.
Or am I on the completely wrong track?
Re: Asynchronous Replication - 02-18-2005 , 07:14 AM
gerard.nicol (AT) tapetrack (DOT) com wrote:
It is true that your application's 'send' callback function is called
from somewhere within DbTxn::commit. So if your callback blocks then
yes, the thread writing to the database is blocked (and transaction
locks are held) until your callback returns.
But your callback is free to do whatever it likes. e.g. If you don't
care about receiving acknowledgements from replicas then your callback
can simply copy the message into a buffer and enqueue it for another
thread to send later. Your callback can then return immediately. So
you don't need a separate replication process to do want you want.
However, if you don't wait for 'IS_PERM' acks (for the current LSN or
higher) from some number (usually >50%) of replicas, then your
transaction may not be durable if the current master fails.
The scheme I found to work well is to block the 'send' callback waiting
for >50% of IS_PERM acks (for the current LSN or higher), _and_ use
DB_TXN_NOSYNC on all sites to avoid waiting for the transaction log to
be flushed to disk. This gave high-availability with very good
distributed-durability characteristics, plus a significant performance
increase over a non-replicated version of the application (which did
not use DB_TXN_NOSYNC, because it wanted durability).
Sleepycat mention this potential performance improvement in the
replication docs, and I found it to be quite real.
As it happens, my 'send' callback did buffer the messages and enqueue
them to be sent by a separate communications threadpool, but then it
chose to block on events (or a conditional variable) that the
communications threads signalled when the IS_PERM ack (for the current
LSN or higher) came back. (The 'send' function then counted the events
and waited for >50% of currently-connected-sites.) But this is purely
an implementation detail peculiar to my application architecture --
there's nothing wrong with having your 'send' function do network IO
directly - although you almost certainly want a way to enforce a
The interesting question is, what do you do if you never receive
IS_PERM acks. from enough sites? Do you block forever (as the
Sleepycat documents suggest)? That means your application won't ever
see a successful return from DbTxn::commit for a transaction that has
not been durably replicated, but that nice guarantee comes at a very
high cost; your system becomes completely unavailable for writes for
some indefinite period of time. And it becomes unavailable in a
particularly unpleasant way -- all writer threads block in commit (or
earlier; on a lock held by a blocked transaction, and of course those
locks will start to block reader threads too).
Alternatively you can timeout your wait. If your 'send' callback then
returns false, the master will flush it's transaction log to disk (if
you are using DB_TXN_NOSYNC), thus making the transaction locally
durable at least. However, that's not much consolation if the master
fails shortly afterwards -- your application might now have seen a
successful commit for a transaction that is now no longer in the
Yet another option on timeout is simply to kill the current Master
immediately -- i.e. call abort() -- and let an election happen. That's
fairly drastic, but can be the right choice in some circumstances; your
application never sees a sucessful commit that has not been durably
replicated, and at least threads aren't in danger of disappearing into
a black hole, as they do in the 'wait forever' option.
It is unfortunate that there is no direct way to pass the
'comitted-but-not-durably-replicated' condition back from your 'send'
callback function through to the return code of DbTxn::commit. I
suggested this to Sleepycat but they couldn't find a way to implement
it. Of course, your 'send' callback could always set a flag in some
global application state, and a wrapper around the commit call could
check that state.
I think the _ideal_ way of handling such a timeout would be for the
'send' callback to return a failure code that tells the local site
(Master) to abort this transaction -- i.e. the DbTxn::commit call would
throw or return a failure condition and the transaction would be rolled
back. Unfortunately this is not possible, as the 'send' callback is
called after the Master has written the commit to the local transaction
log, and this cannot be undone. So if you want that level of
replication consistency guarantee, you need to build an
application-level replication system using full two-phase commit; i.e.
DbTxn:repare and DbEnv::txn_recover, plus an independent
GlobalTransactioManager. That's not to hard to do, but each
transaction now incurs (and blocks on) significantly more network
messages, so there is a throughput/performance cost. As compensation,
such a system would give you multi-master replication -- all sites can
write. Of course, you then have to worry about distributed-deadlock
detection, as Sleepycat's deadlock detector only looks for cycles in
the local waiting-for-locks graph, not the overall system graph.
Rather than do anything complicated, I chose to use the
transaction-timeout feature to detect distributed deadlocks, as they
are relatively rare.
I hope this helps.
Re: Asynchronous Replication - 02-18-2005 , 01:32 PM
Thanks for taking the time to give such a detailed answer.
You have basically confirmed my understanding of the process.
few minutes to flush the file.
This seems a little backward considering my understanding is that these
records are already written out to the log files already.
So would an acceptable approach be to write a command line utility that
was run by cron every few minutes that read the enviroment log files
and synced off them. Then at the end did the equivilent of a