![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hi, I use Berkeley-DB to manage more than 250 dbs and every db has the same structure (16 bytes key and 8 bytes data) and almost the same acount of records (about 500,000). These dbs are opened in the same environment with transaction enabled. When initialization, I have 4 threads to retrieve data from them through cursor simultaneously but no two thread will operate on the same db (thread 1 reads db 0, 4, ...; thread 2 reads db 1, 5, ...; etc.). After a while, the cursor read process becomes very slow, it need more than 20 seconds to finish the read from a db. My cache size is 4G and I use the default page size which is 4K. Following is part of my DB_CONFIG: set_cachesize 4 0 4 mutex_set_max 1572864 set_lk_max_lockers 4096 set_lk_max_objects 2048 set_lk_max_locks 8388608 I also used "valgrind --tool=callgrind" to profile my program. I got the following result (I'm not familiar with valgrind so I don't know how to can get a better profiling result): 1,741,812,564 ???:__memp_get_bucket [libdb-4.6.so] 1,727,316,262 ???:__db_tas_mutex_lock [libdb-4.6.so] 959,780,404 ???:__memp_fget [libdb-4.6.so] 888,693,549 ???:__env_alloc [libdb-4.6.so] 882,790,067 ???:__dbc_get [libdb-4.6.so] 798,835,644 ???:__db_tas_mutex_unlock [libdb-4.6.so] 722,079,676 ???:__os_id [libdb-4.6.so] 685,849,568 ???:__ham_func4 [libdb-4.6.so] 681,554,232 ???:__db_retcopy [libdb-4.6.so] 659,823,410 ???:__db_cursor_int [libdb-4.6.so] 526,604,190 ???:__memp_fput [libdb-4.6.so] 505,002,137 // my function 492,475,941 ???:__lock_get_internal [libdb-4.6.so] 455,045,660 ???:__db_ret [libdb-4.6.so] 451,725,119 ???:__db_lget [libdb-4.6.so] 449,867,405 ???:__bamc_next [libdb-4.6.so] 448,777,043 ???:__lock_getobj [libdb-4.6.so] 400,442,895 ???:__dbc_close [libdb-4.6.so] 359,487,207 ???:__dbc_get_arg [libdb-4.6.so] 348,992,625 // ........... 332,184,228 ???:__dbc_idup [libdb-4.6.so] 327,633,228 ???:__bamc_get [libdb-4.6.so] 323,084,790 ???:__lock_put [libdb-4.6.so] 313,985,666 ???:memcpy [/lib/libc-2.3.6.so] 295,780,327 ???:__dbc_get_pp [libdb-4.6.so] 286,679,279 ???:__dbc_cleanup [libdb-4.6.so] 285,409,596 ???:__lock_open [libdb-4.6.so] 284,264,251 ???:__logc_get_int [libdb-4.6.so] 277,579,162 ???:__bamc_close [libdb-4.6.so] 266,029,764 ??? thread_self [/lib/libpthread-2.3.6.so]245,727,648 ???:__lock_get [libdb-4.6.so] 213,873,198 ???:__bamc_refresh [libdb-4.6.so] 180,765,978 ???:__lock_put_internal [libdb-4.6.so] 178,388,632 // ........... 133,158,636 // ........... 128,302,983 // ........... 118,312,090 ???:__dbt_userfree [libdb-4.6.so] 109,718,273 ???:0x000000000006C270 [/lib/libc-2.3.6.so] 107,838,186 ???:__db_check_chksum [libdb-4.6.so] 101,720,408 ???:__db_pthread_mutex_init [libdb-4.6.so] Someone can give me some advice? Thanks a lot! |
#3
| ||||||||
| ||||||||
|
|
Hello, I don't know valgrind either, but I know quite a bit about BDB. You might want to ask your question on the Oracle forums for BDB, in the meantime, here are some thoughts. You say 'After a while, the cursor read process becomes very slow'. I guess that means it is running fast until then? It sounds like you are doing all read operations during this time? |
|
If not, there may be some write operations holding locks - possible contention. If you are doing writing, are you using transactions? |
|
Has your cache been prewarmed - that is, is there reason to believe that the operations before the slowdown are running fast because the cache is already filled with pages from previous accesses? A quick calculation shows that raw data size alone: (16+8) * 500,000 * 250 => 3G. With typical overhead and page fill, the database sizes may be more like 5 or 6 G (BDB generally sacrifices space for speed) so if your testing is 'random' and you are doing complete database scans, it probably won't fit in 4G cache. |
|
If it isn't a cache issue, then maybe there's a resource that is leaking. Are all cursors closed when finished, and temporary memory freed? |
|
You'd probably see process size starting to balloon if that is what is going on. Perhaps there's some other resource contention between the threads. What happens if you run with fewer (or even one) thread? |
|
Answers to any of these could be helpful clues. Another thing to look at is DB statistics. The first one to look at is the output from db_stat -m, which would tell you about the cache, |
|
and db_stat -Co, which can give some clues about contention. Run these after the program slows down. The output can be big, so capture it in a file. |
|
- Don Anderson |
#4
| |||
| |||
|
#5
| |||
| |||
|
|
I noticed a phenomena that, if I stop the program during the initializaiton and start it again, it will be very fast before it reaches where it's stopped last time. After that, the db read becomes slow again. If I clean the evironment (just delete those __db* files), the db read becomes slow very quickly. |
#6
| |||
| |||
|
#7
| ||||||
| ||||||
|
|
Hi Don, Thanks for your help! It's during the initialization period of my program so there is no writing operations. I put detailed answers below. On 8 31 , 2 58 , Don Anderson <don.ander... (AT) gmail (DOT) com> wrote: Hello, I don't know valgrind either, but I know quite a bit about BDB. You might want to ask your question on the Oracle forums for BDB, in the meantime, here are some thoughts. You say 'After a while, the cursor read process becomes very slow'. I guess that means it is running fast until then? It sounds like you are doing all read operations during this time? It becomes slow very quickly, after reading couple of dbs. Yes, all read operations. If not, there may be some write operations holding locks - possible contention. If you are doing writing, are you using transactions? No writing at this time. Has your cache been prewarmed - that is, is there reason to believe that the operations before the slowdown are running fast because the cache is already filled with pages from previous accesses? A quick calculation shows that raw data size alone: (16+8) * 500,000 * 250 => 3G. With typical overhead and page fill, the database sizes may be more like 5 or 6 G (BDB generally sacrifices space for speed) so if your testing is 'random' and you are doing complete database scans, it probably won't fit in 4G cache. I think the cache is not prewarmed (how to prewarm it?). |
|
I noticed a phenomena that, if I stop the program during the initializaiton and start it again, it will be very fast before it reaches where it's stopped last time. After that, the db read becomes slow again. If I clean the evironment (just delete those __db* files), the db read becomes slow very quickly. |
|
The total physical files' size of these dbs is about 7.6G. Yes I'm doing a complete database scan but I don't know whether it is random. I just put a DB_NEXT flag (sequence scan?) to the cursor and keep reading until finished. In my thought, the cache won't fit all the data, but is enough for one db. |
| If it isn't a cache issue, then maybe there's a resource that is leaking. Are all cursors closed when finished, and temporary memory freed? Yes You'd probably see process size starting to balloon if that is what is going on. Perhaps there's some other resource contention between the threads. What happens if you run with fewer (or even one) thread? It's also very slow even I run with only one thread. If I clean the environment, it's very fast when read the first 6 or 7 dbs, but becomes slow after that. |
|
Answers to any of these could be helpful clues. Another thing to look at is DB statistics. The first one to look at is the output from db_stat -m, which would tell you about the cache, The head part of db_stat -m: 4GB Total cache size 4 Number of caches 4 Maximum number of caches 1GB Pool individual cache size 0 Maximum memory-mapped file size 0 Maximum open file descriptors 0 Maximum sequential buffer writes 0 Sleep after writing maximum sequential buffers 0 Requested pages mapped into the process' address space 24M Requested pages found in the cache (98%) 360522 Requested pages not found in the cache 0 Pages created in the cache 360521 Pages read into the cache 0 Pages written from the cache to the backing file 0 Clean pages forced from the cache 0 Dirty pages forced from the cache 0 Dirty pages written by trickle-sync thread 360510 Current total page count 360510 Current clean page count 0 Current dirty page count 524284 Number of hash buckets used for page location 24M Total number of times hash chains searched for a page (24509250) 11 The longest hash chain searched for a page 28M Total number of hash chain entries checked for page (28559440) 0 The number of hash bucket locks that required waiting (0%) 0 The maximum number of times any hash bucket lock was waited for (0%) 0 The number of region locks that required waiting (0%) 0 The number of buffers frozen 0 The number of buffers thawed 0 The number of frozen buffers freed 361545 The number of page allocations 0 The number of hash buckets examined during allocations 0 The maximum number of hash buckets examined for an allocation 0 The number of pages examined during allocations 0 The max number of pages examined for an allocation 0 Threads waited on page I/O |
|
and db_stat -Co, which can give some clues about contention. Run these after the program slows down. The output can be big, so capture it in a file. The output of db_stat -Co: =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock REGINFO information: Lock Region type 8 Region ID data/__db.008 Region name 0x2affad7b6000 Original region address 0x2affad7b6000 Region address 0x2affad7b6068 Region primary address 0 Region maximum allocation 0 Region allocated Region allocations: 8394757 allocations, 0 failures, 0 frees, 1 longest REGION_JOIN_OK Region flags =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Locks grouped by object: Locker Mode Count Status ----------------- Object --------------- 17d READ 1 HELD #040.db handle 0 10b READ 1 HELD #002.db handle 0 114 READ 1 HELD #005.db handle 0 19c READ 1 HELD #050.db page 3987 126 READ 1 HELD #011.db handle 0 117 READ 1 HELD #006.db handle 0 165 READ 1 HELD #032.db handle 0 19c READ 1 HELD #050.db page 2584 10e READ 1 HELD #003.db handle 0 13e READ 1 HELD #019.db handle 0 171 READ 1 HELD #036.db handle 0 12f READ 1 HELD #014.db handle 0 15f READ 1 HELD #030.db handle 0 144 READ 1 HELD #021.db handle 0 186 READ 1 HELD #043.db handle 0 162 READ 1 HELD #031.db handle 0 12c READ 1 HELD #013.db handle 0 177 READ 1 HELD #038.db handle 0 174 READ 1 HELD #037.db handle 0 108 READ 1 HELD #001.db handle 0 105 READ 1 HELD #000.db handle 0 17a READ 1 HELD #039.db handle 0 13b READ 1 HELD #018.db handle 0 189 READ 1 HELD #044.db handle 0 102 READ 1 HELD sys.db handle 0 180 READ 1 HELD #041.db handle 0 156 READ 1 HELD #027.db handle 0 120 READ 1 HELD #009.db handle 0 159 READ 1 HELD #028.db handle 0 132 READ 1 HELD #015.db handle 0 138 READ 1 HELD #017.db handle 0 18f READ 1 HELD #046.db handle 0 123 READ 1 HELD #010.db handle 0 14d READ 1 HELD #024.db handle 0 19b READ 1 HELD #050.db handle 0 11d READ 1 HELD #008.db handle 0 168 READ 1 HELD #033.db handle 0 129 READ 1 HELD #012.db handle 0 198 READ 1 HELD #049.db handle 0 141 READ 1 HELD #020.db handle 0 135 READ 1 HELD #016.db handle 0 153 READ 1 HELD #026.db handle 0 147 READ 1 HELD #022.db handle 0 111 READ 1 HELD #004.db handle 0 150 READ 1 HELD #025.db handle 0 195 READ 1 HELD #048.db handle 0 11a READ 1 HELD #007.db handle 0 15c READ 1 HELD #029.db handle 0 192 READ 1 HELD #047.db handle 0 18c READ 1 HELD #045.db handle 0 16e READ 1 HELD #035.db handle 0 16b READ 1 HELD #034.db handle 0 14a READ 1 HELD #023.db handle 0 183 READ 1 HELD #042.db handle 0 |
#8
| |||||
| |||||
|
|
Hello Rabbit, Some comments inline. On Aug 30, 10:12 pm, Rabbit <McE... (AT) gmail (DOT) com> wrote: DB cache is a simple cache of database pages, when the cache fills up the least recently used page is kicked out. Prewarming just means that you've accessed the database in advance so the pages you need are already there. This all indicates that the performance differences are explained by things being in cache or out of cache. Memory speeds are a few orders of magnitude faster than disk speeds. The BDB cache operates differently than you probably think. It is not: here the guy is walking through a cursor, so I can predict what the next access should be and prefetch the next page. It is more like - the cursor visits all the data that is stored on a page and when it is done, it must read the next page in the DB file - that's a disk op that must wait until compilation. One thing I have to ask - is this the typical operation for your application, reading sequentially through an entire database? Because we are worried about optimizing this, let's be sure this is the important use case. What is the point for the scan? If you are searching for something, you might want to set up a secondary table instead to find your your data in a fraction of the time. Or if you need some tabulation (like a total of a column), you can maintain that as you are building the data. These things have costs of course, you just need to choose the use cases that matter to you. |
|
For optimizing this particular thing, you might try: using a BTREE database if you aren't already. That alone probably won't fix anything, but you could then reload the database: shutdown the database environment and application db_dump foo.db | db_load foo2.db mv foo2.db foo.db start the application That gets the order of the pages in the DB file to closely match the logical order of the keys, and a sequential scan would probably work best with most OS's which will do prefetching, even if BDB isn't. |
| That is the data that doesn't mesh with a typical cache problem - but if the OS is also caching than the OS would 'remember' a previous run. Another thing to try: Use the DB_ENV->set_mp_mmapsize() API with a small number, maybe 1 megabyte. This only makes a difference if you are opening your databases readonly. |
|
Also -- how big is the memory on your system? If it is not much more than 4 GB, then you have made your BDB cache too large, you are thrashing. Regardless, you might want to experiment with a *smaller* BDB cache to make sure. |
| No big surprises. Your cache hit rate is high - 98%. But because of the high volume, you still have 360522 misses, these are all waiting on disk. To verify this is the issue, perhaps do db_stat -m once every 10 seconds while you are running your test, and correlate it to your response time. The numbers in db_stat -m are cumulative, so you should be able to see lots of cache misses exactly during the times of the slowdown. With a little arithmetic, you can see what your disk access time is. Perhaps there are good reasons it is slow - is the disk on a network drive? |
![]() |
| Thread Tools | |
| Display Modes | |
| |