dbTalk Databases Forums  

Star Schema and scalability

comp.databases.berkeley-db comp.databases.berkeley-db


Discuss Star Schema and scalability in the comp.databases.berkeley-db forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Mitch
 
Posts: n/a

Default Star Schema and scalability - 11-09-2005 , 04:44 PM






Hey,

I am considering BerkeleyDB as an alternative to a traditional
relational data warehouse solution (i.e. star schema).

Am I being foolhearty?

Our fact table often receives 10 million rows per day, with storage
capacity being the only real upper limit.

My current thinking is to implement the fact table as multiple
databases (ala partitioned tables in Oracle), with each fact database
covering a distinct time sequence (i.e. month/week). Each attribute
table will be a seperate database. My db environment will contain all
attribute databases, and some (MRU) subset of fact databases.

Each fact record currently consumes ~104 bytes, resulting in > 1Gig per
day.

I'm looking for opinions, is BerkeleyDB up to such a challenge? How
would this design perform relative to Oracle/MSSQL?

Any feedback would be greatly appreciated.

Thanks.
-Mitch


Reply With Quote
  #2  
Old   
david.boreham@gmail.com
 
Posts: n/a

Default Re: Star Schema and scalability - 11-10-2005 , 08:36 AM







Mitch wrote:

Quote:
I am considering BerkeleyDB as an alternative to a traditional
relational data warehouse solution (i.e. star schema).

Am I being foolhearty?
Short answer I think is 'no'.

Quote:
Our fact table often receives 10 million rows per day, with storage
capacity being the only real upper limit.

My current thinking is to implement the fact table as multiple
databases (ala partitioned tables in Oracle), with each fact database
covering a distinct time sequence (i.e. month/week). Each attribute
table will be a seperate database. My db environment will contain all
attribute databases, and some (MRU) subset of fact databases.

Each fact record currently consumes ~104 bytes, resulting in > 1Gig per
day.
I've seen single BDB files that were 10's of Gigs. The theoretical
maximum database size is very large indeed. I don't think you will
see any problems just from the volume of data you're storing.
The need for partitioning I'm not sure about : there are a few
different
reasons to partition. You'd need to say more about your goals
with partitioning I think (safety, performance, operations issues...).

Quote:
I'm looking for opinions, is BerkeleyDB up to such a challenge? How
would this design perform relative to Oracle/MSSQL?
I think performance would depend on the query execution code you build.
Insert performance and the bare-metal b-tree lookups should perform
at roughly the same level as other database products (they're all using
similar techniques). I'm not sure what concurrency control issues you
may have : if those are challenging then something with MVCC might
perform better (BDB uses page-level locking for concurrency control).

If it were me, the next thing I'd do is build a simple prototype
application
that can generate the type of data and access pattern that you expect
to have
against BDB. Run that thing and see how it performs.



Reply With Quote
  #3  
Old   
Mitch
 
Posts: n/a

Default Re: Star Schema and scalability - 11-10-2005 , 10:42 AM



WRT partitioning, the motivating factors are performance and data
lifecycle management. The idea is that as data ages, it is accessed
less frequently. For example, if we keep a full years worth of fact
data (using 52 weekly partitioned fact databases), it is unlikely that
users will access data more than a few weeks old, hence the MRU
approach. With this model we would keep all the attribute data online,
and only hold the most recently used fact databases open.

Data lifecycle management becomes easier with partitioning. With each
new week (partition), a new fact database is created, and the oldest is
archived and deleted.

Partitioning the fact database does complicate searching. We query
almost exclusively using time-based range queries. This means
analyzing the time range criteria of each query to determine which
database partitions to search (i.e. pruning).

WRT concurrency, as a data warehouse we are primarily an insert and
query model. We almost exclusively perform append inserts based on
timestamp. There are very few updates to our data. If updates occur at
all, it is very close to the 'top' of the fact database.

Building a prototype is exactly my goal. However, I need management
buy-in to allocate the time so I'll need to clear the standard set of
questions/objections. Hence this posting. Of course one of the big
questions on everyone's mind will be "What! no SQL?..." My counter
argument will be to reference the amount of time I spend generating
execution plans and tuning SQL statements that were written "in a
fraction of the time required to write queries in Java/C/etc."

Your responses are exactly the kind of information I am looking for,
and are much appreciated.

Thank you.
-Mitch


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.