dbTalk Databases Forums  

Hash-table versus B-Tree

comp.database comp.database


Discuss Hash-table versus B-Tree in the comp.database forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Alex Fraser
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 04:16 PM






"Nick Landsberg" <SPAMhukolauTRAP (AT) SPAMworldnetTRAP (DOT) att.net> wrote

Quote:
Alex Fraser wrote:
"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message
news:41B5A6AC.FCE24212 (AT) yahoo (DOT) com...
[snip]
[A hash table] Scales just fine until the data base no longer fits in
virtual memory. Then the B-tree is far superior.

Why does a B-tree become far superior in that case?

I'll butt in -

If all the data no longer fits into memory, then, with
a hash, you have to go and read it from disk. Consider
overflow blocks in the hash. You will likely have to
read at least 1.5 disk records (with a GREAT hash algorithm),
to get the record you want.

It is likely, with modern machines, that
the whole B-tree can be kept in memory, you look it
(the index) up in the B-tree, and then go and read it
from disk directly. A single I/O.
If you store the hash value and record number in an in-memory hash table,
the chances of needing to read in anything other than the desired record
should be very low. Or am I missing your point?

Move that hash table to disc, and use sequential probing, and most of the
time you'll need two reads. The biggest problem with this is delete
operations, or rather, minimising the expense of rebuilding the hash table
when there have been many delete operations.

Alex




Reply With Quote
  #12  
Old   
Alex Fraser
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 04:16 PM






"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote

Quote:
Alex Fraser wrote:
"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message
[snip]
[A hash table] scales just fine until the data base no longer fits in
virtual memory. Then the B-tree is far superior.

Why does a B-tree become far superior in that case?

Because it does not depend on pointers, which in turn will depend
on loading order, and other outside influences.
Why do you need pointers?

Maybe we have different things in mind. See my other post for what I was
thinking of.

[snip]
Quote:
Sorting is perfectly possible, with virtually no additional space
required, with a hash system. See the examples I provide with
hashlib.
Which example(s)?

Quote:
Also, the hashlib system wastes little space - the fundamental
table of pointers is maintained between roughly 45 and 90% full,
and non-existant records take no space whatsoever.
Well, space for the pointer only.

Alex




Reply With Quote
  #13  
Old   
Nick Landsberg
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 10:13 PM



Alex Fraser wrote:
Quote:
"Nick Landsberg" <SPAMhukolauTRAP (AT) SPAMworldnetTRAP (DOT) att.net> wrote in message
news:xcltd.1048486$Gx4.219589 (AT) bgtnsc04-news (DOT) ops.worldnet.att.net...

Alex Fraser wrote:

"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message
news:41B5A6AC.FCE24212 (AT) yahoo (DOT) com...

[snip]

[A hash table] Scales just fine until the data base no longer fits in
virtual memory. Then the B-tree is far superior.

Why does a B-tree become far superior in that case?

I'll butt in -

If all the data no longer fits into memory, then, with
a hash, you have to go and read it from disk. Consider
overflow blocks in the hash. You will likely have to
read at least 1.5 disk records (with a GREAT hash algorithm),
to get the record you want.

It is likely, with modern machines, that
the whole B-tree can be kept in memory, you look it
(the index) up in the B-tree, and then go and read it
from disk directly. A single I/O.


If you store the hash value and record number in an in-memory hash table,
the chances of needing to read in anything other than the desired record
should be very low. Or am I missing your point?
From the heuristics which I remember from many years ago,
the average "probes per search" for a good generic hash
function was about 1.5 . Which translates to about 1.5
disk reads per hash, if the hash table is in memory
and the actual data is on disk.

Quote:
Move that hash table to disc, and use sequential probing, and most of the
time you'll need two reads. The biggest problem with this is delete
operations, or rather, minimising the expense of rebuilding the hash table
when there have been many delete operations.
This depends upon your hash function and how disk is
accessed. If you use a minimalist approach, then
your hash function points to a logical file sector
on disk where the data record can be found, or
where pointers to overflow sectors can be found.
I have never considered rebuilding the hash table
on the fly when the number of records changed
substantilly, but will give it some thought.
I wonder what the tradeoffs are. Would you have
to relocate records on the disk? This would
seem to be counter-productive.

Don't get me wrong, please. I a proponent of hash
tables, but IFF you know what the heck you're doing
ahead of time, e.g. expected number of records
to set the table size.



Quote:
Alex



--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch


Reply With Quote
  #14  
Old   
CBFalconer
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 11:36 PM



Nick Landsberg wrote:
Quote:
.... snip ...

This depends upon your hash function and how disk is
accessed. If you use a minimalist approach, then
your hash function points to a logical file sector
on disk where the data record can be found, or
where pointers to overflow sectors can be found.
I have never considered rebuilding the hash table
on the fly when the number of records changed
substantilly, but will give it some thought.
I wonder what the tradeoffs are. Would you have
to relocate records on the disk? This would
seem to be counter-productive.

Don't get me wrong, please. I a proponent of hash
tables, but IFF you know what the heck you're doing
ahead of time, e.g. expected number of records
to set the table size.
Try out my hashlib, on my site below, download section. It is
instrumented so you can tell the average probes per search, etc,
and handles automatic expansion.

--
Chuck F (cbfalconer (AT) yahoo (DOT) com) (cbfalconer (AT) worldnet (DOT) att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!



Reply With Quote
  #15  
Old   
Nick Landsberg
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-08-2004 , 08:05 PM



CBFalconer wrote:

Quote:
Nick Landsberg wrote:

... snip ...

This depends upon your hash function and how disk is
accessed. If you use a minimalist approach, then
your hash function points to a logical file sector
on disk where the data record can be found, or
where pointers to overflow sectors can be found.
I have never considered rebuilding the hash table
on the fly when the number of records changed
substantilly, but will give it some thought.
I wonder what the tradeoffs are. Would you have
to relocate records on the disk? This would
seem to be counter-productive.

Don't get me wrong, please. I a proponent of hash
tables, but IFF you know what the heck you're doing
ahead of time, e.g. expected number of records
to set the table size.


Try out my hashlib, on my site below, download section. It is
instrumented so you can tell the average probes per search, etc,
and handles automatic expansion.

I've downloaded it Chuck, and looked at the code, but
work time prohibits me from doing a thorough test.
From what I can see of the code, it's first-class
stuff.

NPL


--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch


Reply With Quote
  #16  
Old   
Alex Fraser
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-10-2004 , 01:47 PM



"Nick Landsberg" <SPAMhukolauTRAP (AT) SPAMworldnetTRAP (DOT) att.net> wrote

Quote:
Alex Fraser wrote:
[snip]
If you store the hash value and record number in an in-memory hash
table, the chances of needing to read in anything other than the
desired record should be very low. Or am I missing your point?

From the heuristics which I remember from many years ago,
the average "probes per search" for a good generic hash
function was about 1.5 .
Obviously, this depends on the hash table loading. I think 1.5 is right for
an optimal hash function and 50% load (using a "closed" table, ie no
chaining).

Quote:
Which translates to about 1.5 disk reads per hash, if the hash table is
in memory and the actual data is on disk.
Not with what I attempted to describe. Let me try to explain more fully.

Suppose you have a (good) hash function which produces 32- or 64-bit hash
values. After computing the hash value for the key of the record you are
(say) searching for, you must somehow reduce it to the size of the table (eg
using the modulus operator), and that is where you start your probe
sequence. But, and this is the main point, you don't throw away the original
hash value. Instead, you compare it to the hash values you cunningly stored
in the table (along with the record number). If the range of hash values is
much larger than the size of the table, then quite often you won't need to
read the records when keys don't match: even if they hash to the same slot,
the hash values will probably differ.

You can alternatively consider the idea of storing the hash value in the
table as as a cache. The property relied upon is that if the hash values of
two keys differ, then the keys must differ. By caching the hash value, you
have a cheaper way to prove inequality (and continue the probe sequence,
without examining the record).

[snip]
Quote:
I have never considered rebuilding the hash table
on the fly when the number of records changed
substantilly, but will give it some thought.
I wonder what the tradeoffs are. Would you have
to relocate records on the disk? This would
seem to be counter-productive.
You can avoid the need to relocate records by using indirection (this
applies if you use a B-tree, too). Basically, you seperate the index (hash
table or B-tree) from the records themselves.

I read somewhere on the web about a hash table implementation which sought
to reduce the pain of rebuilding. Unfortunately, I don't remember how it
worked and I can't find it now, but I wonder if it could be applied to this
situation.

Alex




Reply With Quote
  #17  
Old   
CBFalconer
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-10-2004 , 05:11 PM



Alex Fraser wrote:
Quote:
.... snip ...

I read somewhere on the web about a hash table implementation which
sought to reduce the pain of rebuilding. Unfortunately, I don't
remember how it worked and I can't find it now, but I wonder if it
could be applied to this situation.
One such is my hashlib, available on the download section of my
page. It doesn't reduce, it eliminates.

--
Chuck F (cbfalconer (AT) yahoo (DOT) com) (cbfalconer (AT) worldnet (DOT) att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!




Reply With Quote
  #18  
Old   
Alex Fraser
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-11-2004 , 01:53 AM



"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote

Quote:
Alex Fraser wrote:
... snip ...

I read somewhere on the web about a hash table implementation which
sought to reduce the pain of rebuilding. Unfortunately, I don't
remember how it worked and I can't find it now, but I wonder if it
could be applied to this situation.

One such is my hashlib, available on the download section of my
page. It doesn't reduce, it eliminates.
The pain I referred to was time complexity: rebuilding the table by adding
each entry to a new hash table then discarding the old table (the obvious
algorithm, as used in hashlib) is O(n).

If there are a large number of inserts and deletes (balanced, so the number
of entries is constant in the long term) the table will typically become
cluttered with deleted entries, and performance will suffer unless the table
is (frequently) rebuilt.

If the number of entries is also large, this can result in a significant
proportion of the time being used for rebuilding. Even if the average
performance of operations is excellent, ie O(1), rebuilding will give some
probably undesirable "bumps" in performance. These are the problems
addressed by what I mentioned I read.

Alex




Reply With Quote
  #19  
Old   
CBFalconer
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-11-2004 , 04:02 AM



Alex Fraser wrote:
Quote:
"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message
Alex Fraser wrote:

... snip ...

I read somewhere on the web about a hash table implementation which
sought to reduce the pain of rebuilding. Unfortunately, I don't
remember how it worked and I can't find it now, but I wonder if it
could be applied to this situation.

One such is my hashlib, available on the download section of my
page. It doesn't reduce, it eliminates.

The pain I referred to was time complexity: rebuilding the table by
adding each entry to a new hash table then discarding the old table
(the obvious algorithm, as used in hashlib) is O(n).
However there are savings available in rebuilding versus inserting.

Quote:
If there are a large number of inserts and deletes (balanced, so
the number of entries is constant in the long term) the table will
typically become cluttered with deleted entries, and performance
will suffer unless the table is (frequently) rebuilt.
Hashlib uses the same criterion, approximately 90% full, to trigger
rebuilding. However if deletions can be removed and reach a
different criterion, the table does not expand.

Quote:
If the number of entries is also large, this can result in a
significant proportion of the time being used for rebuilding. Even
if the average performance of operations is excellent, ie O(1),
rebuilding will give some probably undesirable "bumps" in
performance. These are the problems addressed by what I mentioned
I read.
You would think so at first glance. While there are 'bumps' in the
access times, they are much smaller than one would expect
a-priori. At any rate, the system has built in instrumentation, so
you can measure all these things out of the box.

--
Chuck F (cbfalconer (AT) yahoo (DOT) com) (cbfalconer (AT) worldnet (DOT) att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!




Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.