dbTalk Databases Forums  

Btree insertion performance

comp.databases.berkeley-db comp.databases.berkeley-db


Discuss Btree insertion performance in the comp.databases.berkeley-db forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Sergey Maslyakov
 
Posts: n/a

Default Btree insertion performance - 05-17-2006 , 03:00 PM






Hi!

My question is a bit more generic but it originates from my stress
testing of BDB 4.4.20 with DB_BTREE tables.

Is the following statement correct? The insertion time of a new unique
and random element into a Btree is in a linear depenedency to the
number of elements in the tree. For instance, it takes 5 seconds to
insert a new unique random element in a tree that has 10 million
elements. Would it be adequate to say that for a tree of 20 million
elements, the same operation will take about 10 seconds?

I ran some tests of the Btree algorithm alone and I'm leaning towards
accepting that as true. Please correct me if I'm wrong!

But if I am correct, I will need to think about splitting one huge
table (30 million rows) into several smaller tables and providing a
load balancing mechanism of some sort.

I know that, theoretically, a lookup operation has logarithmic
computation complexity but I'm trying to estimate the complexity of an
insertion...


Thanks!
/Sergey


Reply With Quote
  #2  
Old   
AT
 
Posts: n/a

Default Re: Btree insertion performance - 05-18-2006 , 04:16 PM






Sergey,

The time to insert a record in to a btree depends on a number of
things. Primarily it will depend on the amount of I/O that needs to be
done to accomplish the insert. If you are using a transactional system
with durable commits, one element will be the time to flush the log to
the disk. The amount of other I/O will depend on the size of the
database relative to the size of the database cache which you have
configured. This will influence the number of levels of the btree
which may be in memory when you do the insert. There is also the hight
of the btree which will effect the number of I/Os necessary. The hight
of the tree does not increase linearly with the number of records. I
would be surprised if it took 5 seconds to insert 1 record. I am also
surprised that your experiments show that there is a linear
relationship between the time to insert and the number of records
already in the tree.

Michael Ubell
Sleepycat Software.


Reply With Quote
  #3  
Old   
Sergey Maslyakov
 
Posts: n/a

Default Re: Btree insertion performance - 05-18-2006 , 06:27 PM



Michael,

I ran the test for Btree algorithm alone, not as a part of BDB index.
That is where I observed the linearity in the growth of the insertion
time. I also used "5 seconds" and "10 seconds" just for illustrational
purposes. These numbers have nothing in common with the real numbers
I'm getting. I apologize for not being clear about that...


Thanks for the explantion!
/Sergey


Reply With Quote
  #4  
Old   
Mike
 
Posts: n/a

Default Re: Btree insertion performance - 05-18-2006 , 07:22 PM



<sergey.maslyakov (AT) exgate (DOT) tek.com> wrote:

Michael,

Does the problem that I describe below look like a double-buffering
issue?

I'm running a test that inserts 1 million rows into a table with a
pseudo-random 8-byte key. Actually, the random part of the key is just
the lower 4 bytes but I use a custom comparison function for
little-endians. I run it on a Fedora Core 5 system with 1Gb of RAM and
a
rather modern SATA harddrive with EXT3 filesystem on top of it.

What I observe is a sudden decrease in performance in the test. The
most
confsing thing is that in two identical runs the problem starts
happening at different time! I remove database file before starting the
application every time. Intermediate sync() calls do not seem to help
much as they take longer themselves every time. I could possibly
increase the amount of cache for the database but I'm trying to keep it
small to mimic the ratio between the size of the dataset and affordable
cache size. But it still does not explain why the performance degrades
after a different number of records have been inserted. With a bigger
cache it just happens later.

<snip>

Sergey,

It is hard to know exactly what is happening. I does sound like some
sort of interaction with the Linux memory system. I am not real
familue with how large ammounts of I/O are processed but a t some point
I would assume that Linux would decide that there is too much dirty
data in memory and start writing it to disk. If you are using
transactions then you will have to write the log to commit the
transaction, having a long queue of database data pages to be written
out will most likely slow down the commit time. Typically high
througput database configurations include a separate disk (often on a
separate cannel) for the log files.

Michael Ubell
Sleepycat Software.


Reply With Quote
  #5  
Old   
yaakov
 
Posts: n/a

Default Re: Btree insertion performance - 05-19-2006 , 02:21 AM



A few month ago, I studied the insertion performance of a BerkeleyDB
BTree table (without transactions): Our goal was to build a large data
base rather quickly for later read lookups.

The results were rather surprizing --- but fit the theory once I
understood the work that the hard drive has to do when writing data.
The bottom line: Your performance depends on how much you cooperate
with the disk. BerkeleyDB is doing its best to help you.

I got a clear answer whether it is advisable to split one large BTree
into many separate trees.

**** Slow down of random inserts ****

A tiny BTree grows by about 20000 entries per second (it all depends on
the size of the entries, the used system etc. The numbers are only to
illustrate the slowdowns later in the document). This speed decreases
slowly.

At some point, I observed a step down to about 10000, while the
continuous slowdown continues. We observe several such steps.

The performance decrease feels rather linear (double table size ~
double access time) than logarithmic. After a few hours of writing
data, we get as low as 200 entries per second.

Of course, we wanted to know how this fits with the theory --- and how
we could build the table much quicker than that. Read on...

**** Localized inserts ****

Next, I wrote data with ordered keys and got a sustained rate of about
7000 inserts per second.

For the next ---most interesting--- experiment, I divided the key into
two parts: a prefix and a suffix. In an application, the prefix could
be the name of a provider and the suffix could identify one of the many
products of that provider. Because of lexicographical ordering, all
the records of one provider (same prefix) are close to each other in
the BTree.

If I chose for each insert a random prefix and a random suffix, I get
the same results as described above down to 200 inserts per second.
With such a large table, I changed the access pattern: I insert
batches of 10000 entries with fixed prefix. In practical terms, the
program would process one provider and insert abuot 10000 products of
this provider. After finishing with one provider, it would switch over
to another ramdomly chosen provider and insert 10000 products for him.
And so on.

In the same table that allows only 200 random inserts per second, I got
a sustained rate of 7000 ***localized*** inserts per second. I was
able to build a huge data base in this fashion.

**** The theory that explains this ****

Qualitatively, these effects can be explained by the three levels that
slow down data base insters:

* First level: memory access.

As long as your data fits into memory, you can write 20000 entries per
second. For us, this number and the details are not relevant because
we want to write data to the disk in any case.

* Second level: disk bandwidth.

Once the system has to write data to the disk, it is limited by how
many data can be written to the disk. Under good conditions, I get
7000 entries per second.

This explains a sudden decrease in performance one the data base
exceeds the memory cache size. One should not be concerned about this
step down: Any numbers before were not real (for the task of creating
a data base on the disk).

This also explains some later jumps in performance and a rather linear
than logarithmic performance decrease for some time:

In a perfect implementation of a BTree, (for random inserts) higher
levels of the tree would stay in memory and lower levels and leaves
would be written to disk. As the data base grows while the cache size
stays constant, the proportion of the tree ondisk grows --- and with it
the proportion of slow accesses. The number of access operations grows
logarithmically. But the proportion of slow disk operations versus
fast RAM operation grows almost linearly. Also, as the tree starts to
add another level of nodes on the disk, this proportion jumps.

(Mathematically, this is true only while a substantial proportion of
the higher tree levels is in the cache. After that, the access time
will grow rather logarithmically. But there is a third level...)

* Third level: Disk seek time

There is a third level of drastic performance loss in the disk: As the
data base grows, it occupies more space on the disk. Once the write
head has to move large distances, the maximal disk bandwitdth becomes
irrelevant. Instead, the number of jumps from one track to another and
the distance of these jumps determine the performance of the disk
drive. At this level, the average distance of a jump is proportional
to the number of tracks covered by the data base ---- which is again
proportional to the size of the data base. Hence, performance
decreases again linearly.

*** Bottom line for random writes ***

The above mentioned performance reductions are a result of the bare
hardware performance of disk drives. A perfect data base
implementation will show the bare disk performance.

*** Localized writes ***

The picture is as bad only for random writes. As soon as you focus on
some smaller set of data (say, a few thousand very popular entries
within a huge data base or a few thousand products that fall all into
the same sub-tree), a good data base implementation will keep all
relevant information in memory and give you the performance for this
small data set (ignoring the huge part of the data base that is not
accessed right now).

This is happening with BerkeleyDB: If you fix a prefix of the tree,
you effectively work on a small sub-data base.


*** Separating a large data base into small pieces ***

According to this experiments, should I split a huge data base into
10000 smaller data bases to get the performance of smaller data bases?

The answer is: No. Instead think about dividing your work into 10000
smaller pieces and focus on one piece at the time. Make sure that the
prefix of your BTree keys corresponds to the batch (slowly changing),
while the suffix is doing the work (quickly changing).

As a result, BerkeleyDB will arrange it's work at least as efficiently
as it would with 10000 smaller data bases.

This is true only for BTree storage. Hashes don't show a performance
gain by focussing on small batches.

If you really need random access to the whole huge data base, you are
limited by the disk performance ---- again, BerkeleyDB seems to provide
this available disk performance.

*** OS/FS ***

In our experiments, we saw noticable differences between ReiserFS on
Linux and a FreeBSD server.

*** Transactions ***

If you enable transactions and logging for durability, you need to
switch off the disk cache for the log file partition. My first test
started at about 100 transactions per second, even for tiny data bases,
without noticable differences between the Linux and the FreeBSD server.

Bottom line: Don't assume that the above analysis applies to a
transactional data base with logging. I haven't yet done sufficient
testing for this setting to get a feeling what's important there.


Reply With Quote
  #6  
Old   
Sergey Maslyakov
 
Posts: n/a

Default Re: Btree insertion performance - 05-19-2006 , 05:19 PM



Yaakov,

Thanks a bunch for sharing your extensive test results! It is quite a
bit of a food for thought.


/Sergey


Reply With Quote
  #7  
Old   
Sergey Maslyakov
 
Posts: n/a

Default Re: Btree insertion performance - 05-19-2006 , 05:29 PM



Michael,

Thanks for your comments! I'll continue my research to find a way
around the problem, I'm experiencing.


/Sergey


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.