How to optimize processing time for a cube with too many dimension -
10-20-2004
, 01:13 PM
Situation:
Started as a typical sales data cube with your ususal suspect dimensions -
customer, product, store, geography, demographics, etc. With time, number of
dimensions started to grow. For e.g. demographics-related dimensions grew to
include age, gender, income level, education level, etc. etc.
purchase-related dimensions grew to include repeat-vs-new purchase,
promotion, channel, etc. etc. -- so, now we're looking at a cube with about
50+ dimensions. As you may have guessed, this is a bear to process, even when
we have sales_fact and customer rows in 100,000's.
Solution approaches:
You could say that we should redesign the cube and separate it out into
several different cubes, each containing a subset of these 50+ dimensions.
However, there is a real business need to have all these dimensions because
the analyst should have the capability of choosing any one of the
purchase-related dimension and "cross examine" it against any of the other
ones (i.e. distance-to-store vs channel, age vs product category, etc.) So,
just partitioning the dimensions solves the technical challenge at the cost
of not meeting all business needs.
Looking at the available techniques, it seems like cube partitioning helps
to some extent by partitioning fact table rows into separate manageable
chunks. But the problem of having too many dimesions to create aggregate
still hurts performance. However, what I'd really like to do is to have the
cube partitioned by dimensions - i.e. partition 1 has all the demographic
dimensions, partition 2 has all the geography dimensions, partition 3 has all
the product related dimensions, etc. I thought virtual cubes should allow
for this, but we havent' quite gotten it to work the way we want.
Question:
I'd greatly appreciate any pointers in terms of your own experiences in
dealing with similar situation, or references to best practices in dealing
with large cubes with a lot of dimensions. TIA |