Re: Btree insertion performance -
05-19-2006
, 02:21 AM
A few month ago, I studied the insertion performance of a BerkeleyDB
BTree table (without transactions): Our goal was to build a large data
base rather quickly for later read lookups.
The results were rather surprizing --- but fit the theory once I
understood the work that the hard drive has to do when writing data.
The bottom line: Your performance depends on how much you cooperate
with the disk. BerkeleyDB is doing its best to help you.
I got a clear answer whether it is advisable to split one large BTree
into many separate trees.
**** Slow down of random inserts ****
A tiny BTree grows by about 20000 entries per second (it all depends on
the size of the entries, the used system etc. The numbers are only to
illustrate the slowdowns later in the document). This speed decreases
slowly.
At some point, I observed a step down to about 10000, while the
continuous slowdown continues. We observe several such steps.
The performance decrease feels rather linear (double table size ~
double access time) than logarithmic. After a few hours of writing
data, we get as low as 200 entries per second.
Of course, we wanted to know how this fits with the theory --- and how
we could build the table much quicker than that. Read on...
**** Localized inserts ****
Next, I wrote data with ordered keys and got a sustained rate of about
7000 inserts per second.
For the next ---most interesting--- experiment, I divided the key into
two parts: a prefix and a suffix. In an application, the prefix could
be the name of a provider and the suffix could identify one of the many
products of that provider. Because of lexicographical ordering, all
the records of one provider (same prefix) are close to each other in
the BTree.
If I chose for each insert a random prefix and a random suffix, I get
the same results as described above down to 200 inserts per second.
With such a large table, I changed the access pattern: I insert
batches of 10000 entries with fixed prefix. In practical terms, the
program would process one provider and insert abuot 10000 products of
this provider. After finishing with one provider, it would switch over
to another ramdomly chosen provider and insert 10000 products for him.
And so on.
In the same table that allows only 200 random inserts per second, I got
a sustained rate of 7000 ***localized*** inserts per second. I was
able to build a huge data base in this fashion.
**** The theory that explains this ****
Qualitatively, these effects can be explained by the three levels that
slow down data base insters:
* First level: memory access.
As long as your data fits into memory, you can write 20000 entries per
second. For us, this number and the details are not relevant because
we want to write data to the disk in any case.
* Second level: disk bandwidth.
Once the system has to write data to the disk, it is limited by how
many data can be written to the disk. Under good conditions, I get
7000 entries per second.
This explains a sudden decrease in performance one the data base
exceeds the memory cache size. One should not be concerned about this
step down: Any numbers before were not real (for the task of creating
a data base on the disk).
This also explains some later jumps in performance and a rather linear
than logarithmic performance decrease for some time:
In a perfect implementation of a BTree, (for random inserts) higher
levels of the tree would stay in memory and lower levels and leaves
would be written to disk. As the data base grows while the cache size
stays constant, the proportion of the tree ondisk grows --- and with it
the proportion of slow accesses. The number of access operations grows
logarithmically. But the proportion of slow disk operations versus
fast RAM operation grows almost linearly. Also, as the tree starts to
add another level of nodes on the disk, this proportion jumps.
(Mathematically, this is true only while a substantial proportion of
the higher tree levels is in the cache. After that, the access time
will grow rather logarithmically. But there is a third level...)
* Third level: Disk seek time
There is a third level of drastic performance loss in the disk: As the
data base grows, it occupies more space on the disk. Once the write
head has to move large distances, the maximal disk bandwitdth becomes
irrelevant. Instead, the number of jumps from one track to another and
the distance of these jumps determine the performance of the disk
drive. At this level, the average distance of a jump is proportional
to the number of tracks covered by the data base ---- which is again
proportional to the size of the data base. Hence, performance
decreases again linearly.
*** Bottom line for random writes ***
The above mentioned performance reductions are a result of the bare
hardware performance of disk drives. A perfect data base
implementation will show the bare disk performance.
*** Localized writes ***
The picture is as bad only for random writes. As soon as you focus on
some smaller set of data (say, a few thousand very popular entries
within a huge data base or a few thousand products that fall all into
the same sub-tree), a good data base implementation will keep all
relevant information in memory and give you the performance for this
small data set (ignoring the huge part of the data base that is not
accessed right now).
This is happening with BerkeleyDB: If you fix a prefix of the tree,
you effectively work on a small sub-data base.
*** Separating a large data base into small pieces ***
According to this experiments, should I split a huge data base into
10000 smaller data bases to get the performance of smaller data bases?
The answer is: No. Instead think about dividing your work into 10000
smaller pieces and focus on one piece at the time. Make sure that the
prefix of your BTree keys corresponds to the batch (slowly changing),
while the suffix is doing the work (quickly changing).
As a result, BerkeleyDB will arrange it's work at least as efficiently
as it would with 10000 smaller data bases.
This is true only for BTree storage. Hashes don't show a performance
gain by focussing on small batches.
If you really need random access to the whole huge data base, you are
limited by the disk performance ---- again, BerkeleyDB seems to provide
this available disk performance.
*** OS/FS ***
In our experiments, we saw noticable differences between ReiserFS on
Linux and a FreeBSD server.
*** Transactions ***
If you enable transactions and logging for durability, you need to
switch off the disk cache for the log file partition. My first test
started at about 100 transactions per second, even for tiny data bases,
without noticable differences between the Linux and the FreeBSD server.
Bottom line: Don't assume that the above analysis applies to a
transactional data base with logging. I haven't yet done sufficient
testing for this setting to get a feeling what's important there. |