![]() | |
#201
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#202
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#203
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#204
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#205
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#206
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#207
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
#208
| |||
| |||
|
|
On Apr 28, 9:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:ae011bd0-63c3-4373-ba35-2fe9fe6be331 (AT) 26g2000hsk (DOT) googlegroups.com... On Apr 28, 12:25 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:289d3678-7a00-463e-aa99-33c7944ce68b (AT) 27g2000hsf (DOT) googlegroups.com... On Apr 24, 10:11 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:fd63f466-18f7-4986-b378-f5e9f512bbd8 (AT) r66g2000hsg (DOT) googlegroups.com... On Apr 22, 11:39 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4 (AT) m3g2000hsc (DOT) googlegroups.com... On Apr 22, 6:58 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com wrote: "Christoph Rupp" <cruppst... (AT) gmail (DOT) com> wrote in message news:91140c56-1f05-4b5d-b45f-b34920db2051 (AT) x41g2000hsb (DOT) googlegroups.com... Brian, On Apr 21, 11:00 pm, "Brian Selzer" br... (AT) selzer-software (DOT) com wrote: Why not go with #4: 4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images. For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations. If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted. Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged. Your suggestion brings far more onerous constraints on the disk writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk. This allows the recovery scan to determine where to start replaying the log (in the REDO pass). |
|
In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery. Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written. Is this for MVCC? |
![]() |
| Thread Tools | |
| Display Modes | |
| |