![]() | |
#171
| ||||
| ||||
|
|
"David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:d7f538e0-f0e3-4b00-9735-a1ba3cdba21c (AT) b38g2000prf (DOT) googlegroups.com... On Aug 30, 9:14 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:6463ace9-eddf-4889-8e65-17d070220a94 (AT) t1g2000pra (DOT) googlegroups.com.... On Aug 29, 7:47 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:f6ef3678-c7e9-4cd2-acaf-13cac28819d6 (AT) a1g2000hsb (DOT) googlegroups.com... [snip] You're oversimplifying. With a stripe size of 64K, it is highly unlikely that a leaf node will span more than one stripe; therefore, it is highly unlikely for every drive to contribute to the reading of every leaf node. I don't see how I'm oversimplifying. My point is that stripes need to be at least as coarse as the DBMS block size. Do you agree? Yes, I think the stripe size should be a multiple of the DBMS block size. The choice of DBMS block size is another question entirely. Also, you appear to be discounting concurrency, and environments where concurrency is important such as typical OLTP environments are where technologies such as elevator seeking are most effective. Concurrency has nothing to do with the fact that if the stripe size is too small the seeking of the drives won't be independent. That wasn't why I brought up concurrency. You dismissed elevator seeking as an optimization mechanism with respect to the number of seeks requiredby focusing on what appeared to be a single query. Ultimately the same number of seeks will be required for a particular query, but when combined with ninety-nine other queries, some of those seeks can be shared with other queries, thus reducing the total number of seeks required to satisfy the hundred. I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. |
|
By the way, Oracle documentation states that an 8K block size is optimal for most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K. Interestingly, Sql Server uses 8K pages organized into 64K extents, which happens to be the unit of physical storage allocation. Do you know something they don't? Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql Server 7.0 released in 1998. Do you expect that 64k extents are still optimal a decade later given that the product of transfer rate and seek time has been steadily increasing? Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle documentation that cited an 8K block size as being optimal was for their latest version, 11g. I think that 64K extents are still optimal because the technology employed for serializing updates is still locking, and in a concurrent environment with an escalating locking heirachy, a page size or extent size that is too large will cause transactions to block more often. Do you know of a sound and practicable alternative to an escalating locking heirarchy for serializing updates? I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. |
#172
| ||||
| ||||
|
|
"David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:d7f538e0-f0e3-4b00-9735-a1ba3cdba21c (AT) b38g2000prf (DOT) googlegroups.com... On Aug 30, 9:14 am, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:6463ace9-eddf-4889-8e65-17d070220a94 (AT) t1g2000pra (DOT) googlegroups.com.... On Aug 29, 7:47 pm, "Brian Selzer" <br... (AT) selzer-software (DOT) com> wrote: "David BL" <davi... (AT) iinet (DOT) net.au> wrote in message news:f6ef3678-c7e9-4cd2-acaf-13cac28819d6 (AT) a1g2000hsb (DOT) googlegroups.com... [snip] You're oversimplifying. With a stripe size of 64K, it is highly unlikely that a leaf node will span more than one stripe; therefore, it is highly unlikely for every drive to contribute to the reading of every leaf node. I don't see how I'm oversimplifying. My point is that stripes need to be at least as coarse as the DBMS block size. Do you agree? Yes, I think the stripe size should be a multiple of the DBMS block size. The choice of DBMS block size is another question entirely. Also, you appear to be discounting concurrency, and environments where concurrency is important such as typical OLTP environments are where technologies such as elevator seeking are most effective. Concurrency has nothing to do with the fact that if the stripe size is too small the seeking of the drives won't be independent. That wasn't why I brought up concurrency. You dismissed elevator seeking as an optimization mechanism with respect to the number of seeks requiredby focusing on what appeared to be a single query. Ultimately the same number of seeks will be required for a particular query, but when combined with ninety-nine other queries, some of those seeks can be shared with other queries, thus reducing the total number of seeks required to satisfy the hundred. I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. |
|
By the way, Oracle documentation states that an 8K block size is optimal for most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K. Interestingly, Sql Server uses 8K pages organized into 64K extents, which happens to be the unit of physical storage allocation. Do you know something they don't? Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql Server 7.0 released in 1998. Do you expect that 64k extents are still optimal a decade later given that the product of transfer rate and seek time has been steadily increasing? Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle documentation that cited an 8K block size as being optimal was for their latest version, 11g. I think that 64K extents are still optimal because the technology employed for serializing updates is still locking, and in a concurrent environment with an escalating locking heirachy, a page size or extent size that is too large will cause transactions to block more often. Do you know of a sound and practicable alternative to an escalating locking heirarchy for serializing updates? I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. |
#173
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#174
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#175
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#176
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#177
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#178
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#179
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
#180
| ||||||
| ||||||
|
|
I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect? I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database. |
|
I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks? |
|
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable. I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks. Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks. With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer. |
|
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations. |
|
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head? |
|
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary. |
![]() |
| Thread Tools | |
| Display Modes | |
| |