![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
It's about the transactional replication and the publisher's transaction log. The database that is being replicated is in 'simple' recovery model. The problem is that once in a while the transaction log grows 3, 4 times bigger than its normal size. Over the time it stabilized it self to 4 GB, but it happens to automatically grow to 12-15 GB, and last time it got to 62 GB when the replication failed and the database stopped to respond as well due to no more space on the hard disk; the log reader agent was timing out. I figure that was because it could not check the hole 62 GB file in 5 minutes which is the default time out for gathering the default ReadBatchSize of 500 transactions (this doesn't happen when it grows to 'only' 12 GB). The next day I fixed the problem by killing all sessions on that database and making some more room for the log file for the sessions to roll back. Then I changed the ReadBatchSize to 20 and the time out to 50000. I noticed that the log reader immediately returned the 957 transactions and 1914 commands outstanding in the transaction log file (the same number returned by sp_repltrans when ran earlier) but that it took about 1 hour to get the 'No replicated transactions are available' message. Now, the questions: 1. After MinLSN got past the 957 transaction that where holding the Checkpoint from truncating the log, why the log reader had to go trough all transactions to the end of the log file (I think it did so since it took 1 hour to report 'No replicated transactions are available'.) I thought that since those where the only transactions marked for replication and the rest of the transactions in the log file where either committed either rolled back the Checkpoint will truncate most of the log file and the log reader will jump and resume the checking from the new MinLSN. If so, it should have finished in a few seconds. I should mention that because the 'no more space' issue SQL Server logged a message that it stopped the automatic Checkpoint, however I manually ran the Checkpoint command several times after bringing the database back online and before clearing the replication. That should restart the automatic Checkpoint, isn't it? 2. Maybe this should have been the first question: what could be the reasons for the initial file growth (up to 12-15 GB)? There is no out of the ordinary activity on the database at the time, no processes failing and rolling back and no backup going on. Could this be also because of the log reader, not being able to keep up with the number of the transactions logged, and holding back the MinLSN? If it is, would the solution be to permanently set the ReadBatchSize to a lower value (like 50 maybe?) and the QueryTimeout to a higher value? |
#3
| |||
| |||
|
#4
| |||
| |||
|
|
Thanks for reply! Simple recovery model is not my decision. The "business" is happy with a possible 24 hours data lost. However, for the replication this is a good thing since this keeps the active transaction log small and so the log reader is faster. I guess since I don't keep the log I can't never say what happened initially. Usually this happens over night, and by the time I come to work the log is clear (just a big, almost empty file) and I can shrink it back. So this kind of answers my second question. The only exception was the last time when it grew to 62GB and when it was the un-replicated transactions that was holding back the MinLSN. Setting a smaller value for the readbatch size did solve the problem. With my first question I'm trying to find out how the log reader advances trough the log. I know that the only transactions needed to be replicated where at the beginning of the file, but I could not shrink the log until after an hour when the log reader finished scanning the entire log (without finding any other transaction for repl.) Why didn't the Checkpoint just discard the log since there where no more open or marked for replication transactions? Then the log reader would not have to scan the entire 62GB file. I think I'm slowly answering my self: Checkpoint needs the log reader to advance in order to be shore that those transactions are not needed anymore. If you have any insights about how log reader progresses trough the log and how Checkpoint gets mixed into this cocktail please post them. Thank you. |
#5
| |||
| |||
|
#6
| |||
| |||
|
|
I thought that ReadBatchSize = 500 means 500 commands (rows) that the log reader needs to read. But you say that it means 500 transactions with who knows how man rows, it could be thousands or even hundreds of thousands of rows. If this is so than I can't really have a tight control over the amount of data the log reader has to read in one shot. It depends on how big the transactions are; I can keep asking the developers to keep the transactions small, but I could never be sure. Looking at the replication statistics I can see an average of 5000 commands per transaction for the log reader and an average of 8000 commands per transaction for the distribution agent. Does this mean that the log reader had to read 500x5000=2.5 millions rows in one shot? Isn't this a big number? By the way, why is it that the statistics for the log reader and distribution agents are never the same? For instance, when I had the replication failure only the log reader agent failed, the distribution agent was reporting that there are no more transactions to distribute. After I finally managed to read the log file the log reader agent showed 20 transactions and 40 commands but the distribution agent showed 957 transactions (the same number that sp_repltrans returned) and 1914 commands!?!? Thank you very much for the detailed description of the log reader methodology. I know now why I always see at least one record for each publication in the MSrepl_transactios table. I have 114 vlfs for a 4 GB transaction log file. Do you think I should chop it to, say, 2 MB to clear most of the vlfs and then increase the size manually in one shot to 4 GB and then SQL will only create 10-20 big vlfs? |
![]() |
| Thread Tools | |
| Display Modes | |
| |