dbTalk Databases Forums  

Hash-table versus B-Tree

comp.database comp.database


Discuss Hash-table versus B-Tree in the comp.database forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
KP Bhat
 
Posts: n/a

Default Hash-table versus B-Tree - 12-07-2004 , 01:17 AM






I am trying to compare and contrast the use of Hash-table and B-Trees
as indexing mechanisms for record-based files. Here's what I have
got so far. Kindly suggest additions/modifications.

Hash-table
1. Simple indexing scheme
2. Insert and delete operations very efficient
3. Retrieve operation very efficient
4. Need to provide a hashing function that hashes record keys into some
number
5. Supporting GetNext and GetBulk operations quite expensive
6. Does not scale that well as the size of the data increases

B-Tree
1. Complex indexing scheme
2. Insert and delete operations slightly expensive due to need to
retain the B-Tree property (i.e. splitting/joining nodes)
3. Retrieve operation very efficient
4. No need to provide a hashing function
5. Supporting the GetNext and GetBulk operations very efficient
6. Scales extremely well as the size of the data increases

Regards,
KP Bhat


[My apologies in advance if this query is OT for any of the posted
groups]


Reply With Quote
  #2  
Old   
kar1107@gmail.com
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 01:40 AM






I assume by B-Tree we also include balanced trees like AVL.
Hash collision is handled thru a linked list.

KP Bhat wrote:
Quote:
I am trying to compare and contrast the use of Hash-table and B-Trees
as indexing mechanisms for record-based files. Here's what I have
got so far. Kindly suggest additions/modifications.

Hash-table
1. Simple indexing scheme
2. Insert and delete operations very efficient
Delete is not efficient. A potentially long linked list need to be
traversed in case of collisions.

Quote:
3. Retrieve operation very efficient
4. Need to provide a hashing function that hashes record keys into
some
number
5. Supporting GetNext and GetBulk operations quite expensive
6. Does not scale that well as the size of the data increases
7. Collision handling can lead to poor searches ("retrieve operation").
The complexity is O(n) for search and delete.

Quote:
B-Tree
1. Complex indexing scheme
2. Insert and delete operations slightly expensive due to need to
retain the B-Tree property (i.e. splitting/joining nodes)
3. Retrieve operation very efficient
4. No need to provide a hashing function
5. Supporting the GetNext and GetBulk operations very efficient
6. Scales extremely well as the size of the data increases

7. search, insert, delete always O(log n) (thus supports #6 above)

Once a solid library is implemented for the tree, its good to go
with tree. You never know you may have to handle large data in future.

Karthik

Quote:
Regards,
KP Bhat


[My apologies in advance if this query is OT for any of the posted
groups]


Reply With Quote
  #3  
Old   
CBFalconer
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 08:00 AM



"kar1107 (AT) gmail (DOT) com" wrote:
Quote:
KP Bhat wrote:

I assume by B-Tree we also include balanced trees like AVL.
Hash collision is handled thru a linked list.

I am trying to compare and contrast the use of Hash-table and
B-Trees as indexing mechanisms for record-based files. Here's
what I have got so far. Kindly suggest additions/modifications.

Hash-table
1. Simple indexing scheme
2. Insert and delete operations very efficient

Delete is not efficient. A potentially long linked list need to
be traversed in case of collisions.
Not necessarily so. See hashlib, reference later. No linked list
is involved, and the operation is O(1).

Quote:
3. Retrieve operation very efficient
4. Need to provide a hashing function that hashes record keys
into some number
This is usually a trivial operation. The trick is to supply a hash
function suitable for the data involved, and good systems can be
quite forgiving. The penalty for a bad function is poorer
performance.

Quote:
5. Supporting GetNext and GetBulk operations quite expensive
Depends on your definition of these. I have no idea what you have
in mind.

Quote:
6. Does not scale that well as the size of the data increases
Scales just fine until the data base no longer fits in virtual
memory. Then the B-tree is far superior.

Quote:
7. Collision handling can lead to poor searches ("retrieve
operation"). The complexity is O(n) for search and delete.
Not so. Again, see the hashlib reference at end. Search and
delete times are O(1).

Quote:

B-Tree
1. Complex indexing scheme
2. Insert and delete operations slightly expensive due to need
to retain the B-Tree property (i.e. splitting/joining nodes)
3. Retrieve operation very efficient
4. No need to provide a hashing function
5. Supporting the GetNext and GetBulk operations very efficient
6. Scales extremely well as the size of the data increases

7. search, insert, delete always O(log n) (thus supports #6 above)

Once a solid library is implemented for the tree, its good to go
with tree. You never know you may have to handle large data in
future.
You can conveniently examine the characteristics of hash systems
via hashlib, which is available under GPL at:

<http://cbfalconer.home.att.net/download/hashlib.zip>

--
Chuck F (cbfalconer (AT) yahoo (DOT) com) (cbfalconer (AT) worldnet (DOT) att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!




Reply With Quote
  #4  
Old   
Alex Fraser
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 10:42 AM



"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote

Quote:
"kar1107 (AT) gmail (DOT) com" wrote:
KP Bhat wrote:

I assume by B-Tree we also include balanced trees like AVL.
Hash collision is handled thru a linked list.

I am trying to compare and contrast the use of Hash-table and
B-Trees as indexing mechanisms for record-based files. Here's
what I have got so far. Kindly suggest additions/modifications.

Hash-table
[snip]
6. Does not scale that well as the size of the data increases

Scales just fine until the data base no longer fits in virtual
memory. Then the B-tree is far superior.
Why does a B-tree become far superior in that case?

There are only two fundamental advantages to a B-tree that I can think of:
records are effectively sorted, and there's no wasted space. If neither of
these are important then AFAICS a hash table will offer greater performance,
greater scalability (ignoring the space issue), and probably a simpler
implementation. Conversely, if the fact that records are sorted _is_
important - and I think it often is - then a hash table is simply not
practical.

Alex




Reply With Quote
  #5  
Old   
Nick Landsberg
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 11:51 AM



Alex Fraser wrote:

Quote:
"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message
news:41B5A6AC.FCE24212 (AT) yahoo (DOT) com...

"kar1107 (AT) gmail (DOT) com" wrote:

KP Bhat wrote:

I assume by B-Tree we also include balanced trees like AVL.
Hash collision is handled thru a linked list.


I am trying to compare and contrast the use of Hash-table and
B-Trees as indexing mechanisms for record-based files. Here's
what I have got so far. Kindly suggest additions/modifications.

Hash-table

[snip]

6. Does not scale that well as the size of the data increases

Scales just fine until the data base no longer fits in virtual
memory. Then the B-tree is far superior.


Why does a B-tree become far superior in that case?
I'll butt in -

If all the data no longer fits into memory, then, with
a hash, you have to go and read it from disk. Consider
overflow blocks in the hash. You will likely have to
read at least 1.5 disk records (with a GREAT hash algorithm),
to get the record you want.

It is likely, with modern machines, that
the whole B-tree can be kept in memory, you look it
(the index) up in the B-tree, and then go and read it
from disk directly. A single I/O.


Quote:
There are only two fundamental advantages to a B-tree that I can think of:
records are effectively sorted, and there's no wasted space. If neither of
these are important then AFAICS a hash table will offer greater performance,
greater scalability (ignoring the space issue), and probably a simpler
implementation. Conversely, if the fact that records are sorted _is_
important - and I think it often is - then a hash table is simply not
practical.

Alex


I've done timing tests on an in-memory database (name withheld).
Hashing gives about a 10-15% performance improvement over their
tree indices. As you said, if the sorting of record is
important (e.g. "get next" is a common function), the
trees are the way to go. If not, pay your money and take
your choice.

NPL

--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch


Reply With Quote
  #6  
Old   
CBFalconer
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 01:23 PM



Alex Fraser wrote:
Quote:
"CBFalconer" <cbfalconer (AT) yahoo (DOT) com> wrote in message
"kar1107 (AT) gmail (DOT) com" wrote:
KP Bhat wrote:

I assume by B-Tree we also include balanced trees like AVL.
Hash collision is handled thru a linked list.

I am trying to compare and contrast the use of Hash-table and
B-Trees as indexing mechanisms for record-based files. Here's
what I have got so far. Kindly suggest additions/modifications.

Hash-table
[snip]
6. Does not scale that well as the size of the data increases

Scales just fine until the data base no longer fits in virtual
memory. Then the B-tree is far superior.

Why does a B-tree become far superior in that case?
Because it does not depend on pointers, which in turn will depend
on loading order, and other outside influences. It is designed to
operate efficiently with only one record per level in memory.

Quote:
There are only two fundamental advantages to a B-tree that I can
think of: records are effectively sorted, and there's no wasted
space. If neither of these are important then AFAICS a hash table
will offer greater performance, greater scalability (ignoring the
space issue), and probably a simpler implementation. Conversely,
if the fact that records are sorted _is_ important - and I think
it often is - then a hash table is simply not practical.
Sorting is perfectly possible, with virtually no additional space
required, with a hash system. See the examples I provide with
hashlib. A b-tree is intrinsically sorted (more or less) only on
the key. The hashlib system allows you to sort on any field.
Also, the hashlib system wastes little space - the fundamental
table of pointers is maintained between roughly 45 and 90% full,
and non-existant records take no space whatsoever.

It is perfectly possible to maintain a continuously ordered binary
tree within a hash system. This is not the same thing as a sorted
list, however, and will change insertion from O(1) to O(logN).

Access times in a b-tree are O(logN). Most hash table operations
are, or can be, O(1). You have to consider the operations your
system really needs when selecting a method.

--
Chuck F (cbfalconer (AT) yahoo (DOT) com) (cbfalconer (AT) worldnet (DOT) att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!



Reply With Quote
  #7  
Old   
kar1107@gmail.com
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 01:48 PM



I looked in the webpage for hashlib. I am looking for description of
the algorithm (I see mostly implementation/usage docs in the .zip
file).

Can you describe your algorithm?

By O(1) you mean best/common case, right? AFAIK there isn't any
datastructure that can do better than O(log n) search (ie theoretically
you need atleast O(log n) comparisons to zero-in on the item being
searched).

Karthik


Reply With Quote
  #8  
Old   
kar1107@gmail.com
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 01:49 PM



I looked in the webpage for hashlib. I am looking for description of
the algorithm (I see mostly implementation/usage docs in the .zip
file).

Can you describe your algorithm?

By O(1) you mean best/common case, right? AFAIK there isn't any
datastructure that can do better than O(log n) search (ie theoretically
you need atleast O(log n) comparisons to zero-in on the item being
searched).

Karthik


Reply With Quote
  #9  
Old   
kar1107@gmail.com
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 01:49 PM



I looked in the webpage for hashlib. I am looking for description of
the algorithm (I see mostly implementation/usage docs in the .zip
file).

Can you describe your algorithm?

By O(1) you mean best/common case, right? AFAIK there isn't any
datastructure that can do better than O(log n) search (ie theoretically
you need atleast O(log n) comparisons to zero-in on the item being
searched).

Karthik


Reply With Quote
  #10  
Old   
CBFalconer
 
Posts: n/a

Default Re: Hash-table versus B-Tree - 12-07-2004 , 03:33 PM



"kar1107 (AT) gmail (DOT) com" wrote:
Quote:
I looked in the webpage for hashlib. I am looking for description
of the algorithm (I see mostly implementation/usage docs in the
.zip file).

Can you describe your algorithm?

By O(1) you mean best/common case, right? AFAIK there isn't any
datastructure that can do better than O(log n) search (ie
theoretically you need atleast O(log n) comparisons to zero-in
on the item being searched).
One copy of your article will suffice!

It is a completely ordinary rehash method, of which there are many
descriptions in the literature. The expected performance is O(1),
while worst case performance (with poor hash functions) can be as
bad as O(n). One of the test cases checks that the system survives
such mistreatment.

What makes my system unique, IMO, is that I have combined it with
self adjustment for table size, convenient ways to scan the
complete table content, and the ability to delete entries. In
addition the code is re-entrant so that many tables may be
maintained by the same code. It was an outgrowth of a test bed for
hash functions. It showed up the common fault in malloc packages
where free is O(n) in the number of extant blocks, which in turn
triggered my development of nmalloc for DJGPP, without this fault.

--
Chuck F (cbfalconer (AT) yahoo (DOT) com) (cbfalconer (AT) worldnet (DOT) att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.