![]() | |
#11
| |||
| |||
|
|
Alex Fraser wrote: "CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message news:41B5A6AC.FCE24212 (AT) yahoo (DOT) com... [snip] [A hash table] Scales just fine until the data base no longer fits in virtual memory. Then the B-tree is far superior. Why does a B-tree become far superior in that case? I'll butt in - If all the data no longer fits into memory, then, with a hash, you have to go and read it from disk. Consider overflow blocks in the hash. You will likely have to read at least 1.5 disk records (with a GREAT hash algorithm), to get the record you want. It is likely, with modern machines, that the whole B-tree can be kept in memory, you look it (the index) up in the B-tree, and then go and read it from disk directly. A single I/O. |
#12
| |||
| |||
|
|
Alex Fraser wrote: "CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message [snip] [A hash table] scales just fine until the data base no longer fits in virtual memory. Then the B-tree is far superior. Why does a B-tree become far superior in that case? Because it does not depend on pointers, which in turn will depend on loading order, and other outside influences. |
|
Sorting is perfectly possible, with virtually no additional space required, with a hash system. See the examples I provide with hashlib. |
|
Also, the hashlib system wastes little space - the fundamental table of pointers is maintained between roughly 45 and 90% full, and non-existant records take no space whatsoever. |
#13
| |||
| |||
|
|
"Nick Landsberg" <SPAMhukolauTRAP (AT) SPAMworldnetTRAP (DOT) att.net> wrote in message news:xcltd.1048486$Gx4.219589 (AT) bgtnsc04-news (DOT) ops.worldnet.att.net... Alex Fraser wrote: "CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message news:41B5A6AC.FCE24212 (AT) yahoo (DOT) com... [snip] [A hash table] Scales just fine until the data base no longer fits in virtual memory. Then the B-tree is far superior. Why does a B-tree become far superior in that case? I'll butt in - If all the data no longer fits into memory, then, with a hash, you have to go and read it from disk. Consider overflow blocks in the hash. You will likely have to read at least 1.5 disk records (with a GREAT hash algorithm), to get the record you want. It is likely, with modern machines, that the whole B-tree can be kept in memory, you look it (the index) up in the B-tree, and then go and read it from disk directly. A single I/O. If you store the hash value and record number in an in-memory hash table, the chances of needing to read in anything other than the desired record should be very low. Or am I missing your point? |
|
Move that hash table to disc, and use sequential probing, and most of the time you'll need two reads. The biggest problem with this is delete operations, or rather, minimising the expense of rebuilding the hash table when there have been many delete operations. |
|
Alex |
#14
| |||
| |||
|
|
.... snip ... This depends upon your hash function and how disk is accessed. If you use a minimalist approach, then your hash function points to a logical file sector on disk where the data record can be found, or where pointers to overflow sectors can be found. I have never considered rebuilding the hash table on the fly when the number of records changed substantilly, but will give it some thought. I wonder what the tradeoffs are. Would you have to relocate records on the disk? This would seem to be counter-productive. Don't get me wrong, please. I a proponent of hash tables, but IFF you know what the heck you're doing ahead of time, e.g. expected number of records to set the table size. |
#15
| |||
| |||
|
|
Nick Landsberg wrote: ... snip ... This depends upon your hash function and how disk is accessed. If you use a minimalist approach, then your hash function points to a logical file sector on disk where the data record can be found, or where pointers to overflow sectors can be found. I have never considered rebuilding the hash table on the fly when the number of records changed substantilly, but will give it some thought. I wonder what the tradeoffs are. Would you have to relocate records on the disk? This would seem to be counter-productive. Don't get me wrong, please. I a proponent of hash tables, but IFF you know what the heck you're doing ahead of time, e.g. expected number of records to set the table size. Try out my hashlib, on my site below, download section. It is instrumented so you can tell the average probes per search, etc, and handles automatic expansion. |
#16
| |||
| |||
|
|
Alex Fraser wrote: [snip] If you store the hash value and record number in an in-memory hash table, the chances of needing to read in anything other than the desired record should be very low. Or am I missing your point? From the heuristics which I remember from many years ago, the average "probes per search" for a good generic hash function was about 1.5 . |
|
Which translates to about 1.5 disk reads per hash, if the hash table is in memory and the actual data is on disk. |
|
I have never considered rebuilding the hash table on the fly when the number of records changed substantilly, but will give it some thought. I wonder what the tradeoffs are. Would you have to relocate records on the disk? This would seem to be counter-productive. |
#17
| |||
| |||
|
|
.... snip ... I read somewhere on the web about a hash table implementation which sought to reduce the pain of rebuilding. Unfortunately, I don't remember how it worked and I can't find it now, but I wonder if it could be applied to this situation. |
#18
| |||
| |||
|
|
Alex Fraser wrote: ... snip ... I read somewhere on the web about a hash table implementation which sought to reduce the pain of rebuilding. Unfortunately, I don't remember how it worked and I can't find it now, but I wonder if it could be applied to this situation. One such is my hashlib, available on the download section of my page. It doesn't reduce, it eliminates. |
#19
| |||
| |||
|
|
"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message Alex Fraser wrote: ... snip ... I read somewhere on the web about a hash table implementation which sought to reduce the pain of rebuilding. Unfortunately, I don't remember how it worked and I can't find it now, but I wonder if it could be applied to this situation. One such is my hashlib, available on the download section of my page. It doesn't reduce, it eliminates. The pain I referred to was time complexity: rebuilding the table by adding each entry to a new hash table then discarding the old table (the obvious algorithm, as used in hashlib) is O(n). |
|
If there are a large number of inserts and deletes (balanced, so the number of entries is constant in the long term) the table will typically become cluttered with deleted entries, and performance will suffer unless the table is (frequently) rebuilt. |
|
If the number of entries is also large, this can result in a significant proportion of the time being used for rebuilding. Even if the average performance of operations is excellent, ie O(1), rebuilding will give some probably undesirable "bumps" in performance. These are the problems addressed by what I mentioned I read. |
![]() |
| Thread Tools | |
| Display Modes | |
| |