dbTalk Databases Forums  

indexes vs sorted files in range and equality search

comp.databases.theory comp.databases.theory


Discuss indexes vs sorted files in range and equality search in the comp.databases.theory forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
xareon@gmail.com
 
Posts: n/a

Default indexes vs sorted files in range and equality search - 12-06-2007 , 09:47 AM






hi all, i got a question about the use of indexes and sorted files in
range and equality searches. like you'll sure know

B: number of data pages
R: number of records per page
D: (average) time to read or write disk page

and let's suppose C (data matching time) and H (hashing time) as
unimportant.


theory shows us those simple values (let's take F>2):


sorted files: equality search: Dlog2(B) / range search: Dlog2B +
matched pages

unclustered tree index: equality search: D(1+logF(1.5B)) / range
search: DlogF(1.5B) + matched *records*

clustered tree index: equality search: DogF(1.5B) / range search:
DlogF(1.5B) + matched pages

hash index: equality search: 2D / range search: BD

so, i'd say, about a range search: clustered tree indexes are the
best, then we have sorted files, then unclustered tree indexes
(although they got a faster search in the tree because of the logF
performance over the sorted log2 one, they require many I/O operations
on different pages, whose time cost overwhelms the such logF
benefits). the worst are hash indexes, totally useless for ranged
searches.

about an equality search, hash indexes are the best. then, we have
clustered tree indexes (*STRICTLY* better than sorted files *only*
because the logF trend over log2 one?), sorted files and unclustered
tree indexes, being the worst because if an equality search has many
relevant records, then its analysis becomes similar to the range
search one, which shows that many I/O operations (one for record) for
such index are required, since records could never be on the same
page.


is my analisys basically cannot-deny-that-so-oh-my-god-accomplished-
correct?

Reply With Quote
  #2  
Old   
xareon@gmail.com
 
Posts: n/a

Default Re: indexes vs sorted files in range and equality search - 12-06-2007 , 10:45 AM






On 6 Dic, 17:21, paul c <toledobythe... (AT) ooyah (DOT) ac> wrote:
Quote:
xar... (AT) gmail (DOT) com wrote:

...

is my analisys basically cannot-deny-that-so-oh-my-god-accomplished-
correct?

For those conclusions, it depends on the data, I'd say you need more
assumptions, eg., whether the result sets are singletons, whether joins
are not involved, whether some columns are very few-valued, also maybe
why a zero-time hash is possible for the data. Might be better to try
to characterize the cost for each returned "record", then it might be
clearer what the total cost of the query is.
thank you for answering. so, you're saying that, for instance, the
effective choice of sorted files over unclustered indexes in range
search operations depends on wheter or not the result sets will be
singleton. if so, than there'll be no need to choose sorted files,
otherwise, i'll avoid unclustered indexes because of the possibly
massive I/O operations number. well i could imagine that in a real
life db scenarios, no 100% sure conclusion can be taken, since terms
in the mathematical analysis can change greatly, depending on the
data. so my question could shift into another one: do in common
scenarios equality searches on a field usually return resultset so
large that sorted files are better than unclustered indexed ones?
well, to be honest, i don't expect anyone answers me this statistical-
oriented question in this .theory group, however let me thank you to
confirm me my initial (although not shown in the first post) thought:
theorical analysis can seldom be useful a priori, without pointing out
the data instance the question refers to.

anyone has other ideas about?


Reply With Quote
  #3  
Old   
Jonathan Leffler
 
Posts: n/a

Default Re: indexes vs sorted files in range and equality search - 12-06-2007 , 05:20 PM



xareon (AT) gmail (DOT) com wrote:
Quote:
hi all, i got a question about the use of indexes and sorted files in
range and equality searches. like you'll sure know

B: number of data pages
R: number of records per page
D: (average) time to read or write disk page

and let's suppose C (data matching time) and H (hashing time) as
unimportant.


theory shows us those simple values (let's take F>2):
At this point, F is undefined, so it is hard to know what the
significance of F>2 is.

Quote:
sorted files: equality search: Dlog2(B) / range search: Dlog2B +
matched pages
I'm not clear what your notation means. I think you've compressed two
expressions onto one line:

Sorted files:
Equality search time: D.log2(B)
Range search time: D.log2(B) + matched pages

The slash is just a separator, not a division operator?

You seem to be hypothesizing random access by pages, and using a binary
search algorithm? And what do you do with a data set that consists
of one record with key 1, 2 million records with key 2, and one record
with key 3, (with key values key 1 < key 2 < key 3) and you do an
equality search for key 3? Are you looking for all records or just a
representative record?

How is your range defined? If it is defined as KL (low key) to KH (high
key), you could do a (binary) search for KL and then continue reading
sequentially until you reach a value beyond KH (or reach KH if your
range excluded the upper bound).

Quote:
unclustered tree index: equality search: D(1+logF(1.5B)) / range
search: DlogF(1.5B) + matched *records*
We've got an undefined term in here - so the result is undefined, is it not?

In your terminology, what constitutes an unclustered tree index? One
where the index is ordered (of course) but the data rows corresponding
to the indexed values are not necessarily in the order of the index?

If so, then (give or take the caveats about duplicate versus unique keys
mentioned before), the cost of the index searching depends on the depth
of the index (number of index pages that must be read to find the leaf
pages) plus the cost to read however many rows are matched by the
equality or range. Depending on how many rows there are to read and on
caching (or not) of the data pages, your cost is difficult to calculate;
you need to give a lot more assumptions. An upper bound is one disk
page read per row retrieved - but most systems would hope to reduce that
cost by caching etc.

Quote:
clustered tree index: equality search: DogF(1.5B) / range search:
DlogF(1.5B) + matched pages
Is your clustered index one where the rows corresponding to adjacent
keys in index order are themselves on adjacent (or even the same) page?

Quote:
hash index: equality search: 2D / range search: BD
How is your range search defined again? If you are searching for values
between KL and KH (which are distinct), then you have to do N = (KH - KL
+ 1) hash computations, one for each value that exists between KL and
KH. For integers, that might not be too bad; for strings, it might be
catastrophic. If, on the other hand, your 'range search' is for 'all
values matching the same single key value', then you have a single
access to get to the head of the list, plus a scan of some sort for the
values that hash to the same value (not all of which need be the value
you are looking for, of course -- hashes have collisions).

Quote:
so, i'd say, about a range search: clustered tree indexes are the
best, then we have sorted files, then unclustered tree indexes
(although they got a faster search in the tree because of the logF
performance over the sorted log2 one, they require many I/O operations
on different pages, whose time cost overwhelms the such logF
benefits). the worst are hash indexes, totally useless for ranged
searches.

about an equality search, hash indexes are the best. then, we have
clustered tree indexes (*STRICTLY* better than sorted files *only*
because the logF trend over log2 one?), sorted files and unclustered
tree indexes, being the worst because if an equality search has many
relevant records, then its analysis becomes similar to the range
search one, which shows that many I/O operations (one for record) for
such index are required, since records could never be on the same
page.


is my analisys basically cannot-deny-that-so-oh-my-god-accomplished-
correct?
No. Your analysis between sorted files and unclustered indexes has not
taken into account size of record compared with size of key. Suppose
you have a key size of 8 bytes and a record size of 256 bytes, and a
page size of 1024 bytes. Suppose that pages are identified by an 8 byte
quantity. And suppose the index is a unique index. You can store 64
key+value quantities on an index page, so even with an equality search,
you'll read many fewer index pages with an unclustered index (and one
random data page) compared with just 4 records per page in the sorted
file. If your data set is any size at all, you'll need to read many
more data pages in your sorted file to get to the relevant page than you
will with an un-clustered index. Suppose there are 4096 records, and
the index is perfectly constructed. Then you'll need 2 index page reads
plus 1 data page read to get any row back. By contrast, on average,
you'll need to read an average of 10 data pages in a sorted file. If
I'd increased the number of records, the comparison would be more
skewed; a quarter of a million records would require 3 or maybe 4 page
reads via an index, compared with 16 reads in a sorted file. There are
some sweeping approximations in that - like ignoring page overhead and
identifying leaf nodes from interior nodes in the index and so on. But
the calculation is indicative of the difference. And the difference
between a clustered index and an unclustered index - under the
assumption that the difference in definition is whether the rows are in
the same order as the keys - is not that huge (especially when you
consider the cost of updating a clustered index if a new value has to be
inserted into the middle of the sorted order - or, worse, at the
beginning of the sorted order; then you might have to rewrite a lot of
data to keep everything in clustered - sorted - order!).

Also, with multi-level indexes and any caching, the upper levels of the
tree are in cache (unless your cache is absurdly small). This gives
indexes a bigger advantage.

If I had to characterize them, I'd say hash indexes are great for
equality searches and lousy for anything else. Tree indexes are not bad
for equality searches (though not as fast as a hash), but much better
than hash for range searching. That's why B+Trees are so widely used in
commercial DBMS.

--
Jonathan Leffler #include <disclaimer.h>
Email: jleffler (AT) earthlink (DOT) net, jleffler (AT) us (DOT) ibm.com
Guardian of DBD::Informix v2007.0914 -- http://dbi.perl.org/

publictimestamp.org/ptb/PTB-1963 ripemd128 2007-12-06 21:00:03
A7304ADE234A191551AE2987CF841EFD


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.