Flexible AND scalable data design -
03-07-2009
, 07:33 AM
I am working on the data design of a new version of a very popular
web-facing application. The application deals with ~3 million new customer
records per day. Each of these customers has a wide range of potential
information associated with them. The set of associated customer information
should be "flexible" i.e. the business wants to be able to define new pieces
of information easily depending on the type of customer.
Ignoring performance and scalability, an 'associative' data model design is
ideal. With smaller numbers, I think that design would be suitable. The
problem with the pure associative approach is that we will capture around
100 rows with each new customer record. So 3 million new customers produces
300 million rows of associative data.
I presume, however, with millions of new customers per day, the performance
hit of an associative data model would be unacceptable (compared to a
one-column per piece-of-information approach).
At a high-level, I have considered the following approaches:
Approach 1:
Use an associative model, with clustered indexes and table partitioning. Is
this feasible? What will be the performance hit? Any other tuning that I
could do to make this approach work?
Pros: Highly flexible design.
Cons: Slower, lots of rows
Approach 2:
2) Use a hybrid approach. Most of the data will come from a table with a
fixed number of variable-length character columns (say around 100 columns).
A separate table will define meta-data that determines the purpose of each
of the generic columns for the type of customer. If a customer type is
modified so that it requires more than 100 pieces of information an
associative spill-over table could be used. However, in most cases customer
data will be retrieved by selecting a single row from the data table.
Pros: The common case should be much faster
Cons: More complex design, lots of null values when row is not 100% filled
with data (i.e. in most cases). If the spill-over table is used, speed will
be even slower as two queries will be required. Finding the optimal number
of columns will be difficult.
What is the "cost" of a null value in a column in DB2?
Any feedback, opinions or advice would be appreciated. |