dbTalk Databases Forums  

real data vs. db file

comp.databases.berkeley-db comp.databases.berkeley-db


Discuss real data vs. db file in the comp.databases.berkeley-db forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Philip Guenther
 
Posts: n/a

Default Re: real data vs. db file - 06-02-2006 , 02:22 AM






likun.navipal (AT) gmail (DOT) com wrote:
Quote:
please take a look at my source code, 1024*128 items have been
inserted. if there is only one item in the db, why the db file is so large?
Oops, my mistake: you're using subdatabases. So the single 'item'
shown by db_stat -d is the subdatabase in which all your data is
actually held. Unless you have a particular reason for using
subdatabases (bundling of multiple indexes, etc) I would recommend
passing NULL as the 'database' argument to DB->open(). Subdatabases
add a couple pages of overhead of their own, plus the extra complexity.

For now, you can get the stats for the _real_ B-tree using both the -d
option and the -s option. The later should be supplied the name of the
subdatabase, as stored in the 'table_name' variable in your code.
Since you're using a real environment you'll also want to specify the
path to the environment home using the -h option, ala:

db_stat -h db_home -d table_file_name -s table_name


Btw, I hope your code was intended as a sample only. It's obviously
incomplete---several variable declarations are missing---but if you
really did leave out the DB->close() and DBENV->close() calls then
you're going to have all sorts of problems. The concurrent data store
version enabled by the DB_INIT_CDB does _not_ guarantee recoverability
after an unclean close.

Anyway, when I take your code, add the missing bits, run it with the
default data size of 1024, and then do the db_stat above on the result,
I get:

Fri Jun 2 00:09:45 2006 Local time
53162 Btree magic number
9 Btree version number
Little-endian Byte order
multiple-databases Flags
2 Minimum keys per-page
4096 Underlying database page size
3 Number of levels in the tree
131072 Number of unique keys in the tree
131072 Number of data items in the tree
30 Number of tree internal pages
56764 Number of bytes free in tree internal pages (53% ff)
2157 Number of tree leaf pages
4064398 Number of bytes free in tree leaf pages (53% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
131072 Number of tree overflow pages
399M Number of bytes free in tree overflow pages (25% ff)
0 Number of empty pages
0 Number of pages on the free list


So, a 53% fill-factor for the tree itself and 25% for the overflow
pages. That makes sense: the low fill-factor for the tree is caused by
the out-of-order insertion, while the low fill-factor for the overflow
pages is a direct result of the 4kB page size with 1kB items. Indeed,
for 4kB pages, data items of 1kB are pretty much pessimal: if they were
smaller they wouldn't be put on overflow pages and they could grow to
almost 4kB in size without using any additional file space.

We can confirm the "out-of-order causes 53% ff" deduction easily enough
by inserting the entries in order by changing the sprintf() format to
"aaaa_key_%06d". The average key length will actually increase with
that, but when we run it again and check db_stat, the relevant lines
now show:

10 Number of tree internal pages
5368 Number of bytes free in tree internal pages (86% ff)
1171 Number of tree leaf pages
47378 Number of bytes free in tree leaf pages (99% ff)

Yep, 1/3 the internal pages and 1/2 the leaf pages compared to the
out-of-order insertion, despite the larger average key size.


Now, what to do about the overflow pages bit. Well, increasing the
page size used would permit all the values to go on the primary pages
where they would be packed together. So, let's increase the page size
from 4kB to 16kB using DB->set_pagesize() and see what happens:

Fri Jun 2 01:15:59 2006 Local time
53162 Btree magic number
9 Btree version number
Little-endian Byte order
multiple-databases Flags
2 Minimum keys per-page
16384 Underlying database page size
3 Number of levels in the tree
131072 Number of unique keys in the tree
131072 Number of data items in the tree
19 Number of tree internal pages
29476 Number of bytes free in tree internal pages (90% ff)
9363 Number of tree leaf pages
15M Number of bytes free in tree leaf pages (90% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
0 Number of tree overflow pages
0 Number of bytes free in tree overflow pages (0% ff)
0 Number of empty pages
0 Number of pages on the free list


Poof! No more overflow pages with their low fill-factor and the file
was only 146.64MB compared to the 516.625MB file of the original 4kB
page, out-of-order insertion version.


Philip Guenther



Reply With Quote
  #12  
Old   
AT
 
Posts: n/a

Default Re: real data vs. db file - 06-02-2006 , 09:15 AM






I use subdatabases for such reasons:

The program will save many devices's output data. At first, i want to
create one subdatabase for each device, but the device number will up
to 50,000! So i decide to create one subdatabase for several devices.
If i put all devices' data into one subdatabase, i think it is not
effecient.
Maybe this decision is wrong.

The source code i put above is just a sample

In my program, the device's output data is just 4 or 8 bytes long, but
there are caches, each cache is divided into several blocks, the block
is 1024 bytes. Most devices' data will be compressed, that means the
block may be compressed to tens of bytes and then put into Berkeley DB.
So, what i concerned is: data size is 1024 bytes at most, tens of bytes
usually. Because each subdatabase holds several devices' data, the
data put into Berkeley DB is not ordered.

I use db_stat -d -s to find the statistic data for a subdatabase, and i
find that, for no-order data, the fill-factor for data size from 16, 32
to 512, are 45% -- 60%. I try to set page size as 4K and 8K, it only
changes when data size is 1024.
It is a little sad to see that the fill-factors are not very high.

Thanks for your reply.

Philip Guenther 写道:

Quote:
likun.navipal (AT) gmail (DOT) com wrote:
please take a look at my source code, 1024*128 items have been
inserted. if there is only one item in the db, why the db file is so large?

Oops, my mistake: you're using subdatabases. So the single 'item'
shown by db_stat -d is the subdatabase in which all your data is
actually held. Unless you have a particular reason for using
subdatabases (bundling of multiple indexes, etc) I would recommend
passing NULL as the 'database' argument to DB->open(). Subdatabases
add a couple pages of overhead of their own, plus the extra complexity.

For now, you can get the stats for the _real_ B-tree using both the -d
option and the -s option. The later should be supplied the name of the
subdatabase, as stored in the 'table_name' variable in your code.
Since you're using a real environment you'll also want to specify the
path to the environment home using the -h option, ala:

db_stat -h db_home -d table_file_name -s table_name


Btw, I hope your code was intended as a sample only. It's obviously
incomplete---several variable declarations are missing---but if you
really did leave out the DB->close() and DBENV->close() calls then
you're going to have all sorts of problems. The concurrent data store
version enabled by the DB_INIT_CDB does _not_ guarantee recoverability
after an unclean close.

Anyway, when I take your code, add the missing bits, run it with the
default data size of 1024, and then do the db_stat above on the result,
I get:

Fri Jun 2 00:09:45 2006 Local time
53162 Btree magic number
9 Btree version number
Little-endian Byte order
multiple-databases Flags
2 Minimum keys per-page
4096 Underlying database page size
3 Number of levels in the tree
131072 Number of unique keys in the tree
131072 Number of data items in the tree
30 Number of tree internal pages
56764 Number of bytes free in tree internal pages (53% ff)
2157 Number of tree leaf pages
4064398 Number of bytes free in tree leaf pages (53% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
131072 Number of tree overflow pages
399M Number of bytes free in tree overflow pages (25% ff)
0 Number of empty pages
0 Number of pages on the free list


So, a 53% fill-factor for the tree itself and 25% for the overflow
pages. That makes sense: the low fill-factor for the tree is caused by
the out-of-order insertion, while the low fill-factor for the overflow
pages is a direct result of the 4kB page size with 1kB items. Indeed,
for 4kB pages, data items of 1kB are pretty much pessimal: if they were
smaller they wouldn't be put on overflow pages and they could grow to
almost 4kB in size without using any additional file space.

We can confirm the "out-of-order causes 53% ff" deduction easily enough
by inserting the entries in order by changing the sprintf() format to
"aaaa_key_%06d". The average key length will actually increase with
that, but when we run it again and check db_stat, the relevant lines
now show:

10 Number of tree internal pages
5368 Number of bytes free in tree internal pages (86% ff)
1171 Number of tree leaf pages
47378 Number of bytes free in tree leaf pages (99% ff)

Yep, 1/3 the internal pages and 1/2 the leaf pages compared to the
out-of-order insertion, despite the larger average key size.


Now, what to do about the overflow pages bit. Well, increasing the
page size used would permit all the values to go on the primary pages
where they would be packed together. So, let's increase the page size
from 4kB to 16kB using DB->set_pagesize() and see what happens:

Fri Jun 2 01:15:59 2006 Local time
53162 Btree magic number
9 Btree version number
Little-endian Byte order
multiple-databases Flags
2 Minimum keys per-page
16384 Underlying database page size
3 Number of levels in the tree
131072 Number of unique keys in the tree
131072 Number of data items in the tree
19 Number of tree internal pages
29476 Number of bytes free in tree internal pages (90% ff)
9363 Number of tree leaf pages
15M Number of bytes free in tree leaf pages (90% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
0 Number of tree overflow pages
0 Number of bytes free in tree overflow pages (0% ff)
0 Number of empty pages
0 Number of pages on the free list


Poof! No more overflow pages with their low fill-factor and the file
was only 146.64MB compared to the 516.625MB file of the original 4kB
page, out-of-order insertion version.


Philip Guenther


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.