Secondary database performance - 04-27-2006 , 12:37 PM
I'm currently evaluating several DBMSs for te near real-time
application. Berkeley DB is one of the contenders. Originally, I
expected it to outperform all other competitiors due to the comparably
small overhead. But to my surprise in one of the tests it performed
real bad. I'll describe what type of a test I ran and I hope you may be
able to point out the root cause of such a slow execution.
I'm running Fedora Core 5 on a Dual Pentium4 2.80GHz each. I create the
primary and all the secondary databases on disk. The system is very
slightly loaded, it is my regular development workstation.
The primary table has two "long long int" fields and four "int" fields.
The first "long long" is the key in the primary database, while the
remaining 5 fields are the keys to the secondary databases. Secondaries
have DB_DUPSORT flag set. I insert 1 million records into the primary
table assigning the value of the current record number to every field.
In this setup the whole procedure takes about 16.5 minutes! Here is a
rough estimate provide by "time" comand:
As you can see it spends over 14 minutes of it waiting on something. I
suppose it is disk I/O.
If I turn on the built-in cache or change flags on the secondary
tables, it does not seem to have any dramatic effect on the numbers.
If I turn off secondary databases, I can have it finished in about 35
So, I suspect there is some inefficiency related to how secondary
database are processed. Or, may be I'm overlooking something crucial?
Competitors beat these numbers badly, about 32 seconds vs 16.5 MINUTES
for 6 keys and 16 seconds vs 36 seconds with primary key only.
Berkeley DB version I'm using is 4.4.20. The test code snippet (C++)
can be found here:
MyDB class definition is identical to what is used in
"examples_cxx/getting_started" of the original Berkeley DB source tree.
Re: Secondary database performance - 04-27-2006 , 07:56 PM
The problem with your test is that you're inserting keys essentially in
random order, from Berkeley DB's point of view. That's because your
keys are little-endian integers, which don't sort in the obvious way
when viewed as bit strings (which is what Berkeley DB does by default).
Please see question #5 on this page for more information:
By adding btree and duplicate comparison callbacks to your code that
compare the integers correctly (code is given on the "Btree comparison"
page linked from the above FAQ), I see about a 10x performance
improvement in your test (to over 30,000 inserts / second, once you
take into account that you are updating 6 databases for each insert).
Re: Secondary database performance - 05-06-2006 , 02:06 PM
I saw your message on the Berkeley DB newsgroup.
I am curious, did the recommendations from Michael help you improve performance?
If yes, how does BDB compare now against the other DBs?
Re: Secondary database performance - 05-07-2006 , 03:14 AM
Yes, Michael's recommendations did help. My test was able to finish in
about 27 seconds vs around 1000 seconds it took before I provided my
own comparison function.
I have to admit that the potential performance problem is described in
the BDB manual, however, it is just a very short subsection at the end
of the document and it does not emphasize the importance of that with
any even rough and exemplary numbers.