Processing efficiency - updated fact table -
12-10-2004
, 12:48 PM
Here is a summary of my problem...
Activity fact table (in MS-SQL) with following schema:
ID bigint
Location varchar(20)
Time DateTime
Time1 DateTime
Time2 DateTime
There are other columns from which dimensions will be built, but I
believe these are enough to illustrate the problem.
"Time" is when the Activity was "opened". Time1 and Time2 are for
followup events related to the Activity. They will be NULL when
INSERTed, but may be UPDATEd later. Only the Activity with the most
recent Time value may be updated for a given Location.
There could be over 10M Activities, but typically only about 10K
distinct Locations. So at any given time only about 10K Activities are
"open" for UPDATE.
There will be dimensions for Location and Time. I need a strategy to
efficiently process a cube(s) with this fact table.
Most obvious approach I see is to have two relational tables:
1. One for new Activities (tblNew) which has the most recent Activity
for each distinct Location.
2. Another for old Activities (tblOld) which has all the rest.
Then I have a cube for each. The cube for tblNew must be processed
fully each time, given that its fact table's entries may been updated.
The cube for tblOld could be processed incrementally. Then have a
virtual cube to "merge" the two to a single cube.
Then the problem reduces to moving Activities from tblNew to tblOld
when a new Activity is created for the same Location.
I have thought if I could have a separate "OldID" column that got
auto-set/incremented only when a new Activity is created for the same
location, then I could use just one fact table. The two cubes would
filter based on the this OldID being NULL. And for incremental cube
processing of old Activities, the OldID would be used to filter for
only the new entries.
Any suggestions on how to approach this problem?
Thanks,
LMcPhee |