The primary assumption with partitioning and queries is that you can
subdivide a problem into small pieces and then only have to do a subset of
the pieces. For example, if you partition by month and have 60 months (and
thus 60 partitions), AND if you queries are restricted by month, such as
just the current month, then we only need to scan 1/60th of the amount of
data if you have an aggregation miss.
HOWEVER, if your query pattern does not match the partitioning scheme, then
partitioning can add overhead. Suppose you partitioned by month, and then
queried for a given product line across all months, what happens? The system
cannot determine which partition has what product line, so it has to scan
them all. It does this in parallel (one thread per partition) and then
merges the results. Suppose you were querying on a product line which was
brand new -- i.e. it only had records in the current month. The system
doesn't know that so it has to scan 60 partitions in parallel, of which 59
are empty and 1 has data in it.
Notice that I said an "aggregation miss" in the first paragraph. For the
most part, if you have a good aggregation design and/or a nice data cache in
memory, then hitting an aggregation with partitioning is just about as good
as if it was 1 partition or 60 partitions. As a worst case, the system has
to add 60 things together. The REAL difference is if you have an aggregation
miss and the system has to scan the fact table. In this case, there is a
huge difference if it has to scan all of the data (even if done in parallel)
and only having to scan 1/60th of the data.
This is why setting the data slice is so important (with SQL2K). It is kind
of like an optimizer hint. If the data slice is set, then the system knows
which partition subset to scan. Without it, it must scan all of them. Thus
without it you defeat the whole purpose of partitioning (as a way of
improving query time -- there are other good reasons to partition such as
being able to do a massive delete of data -- or only have to process a small
subset at a time -- but the heart of any partitioning scheme from a querying
point of view is:
1 - the partitioning schema must be useful for the query pattern -- if the
two don't align then you will have more overhead and doubtful if queries
will run faster.
2 - setting the data slice so the system knows how to subdivide the problem
so it can do a smaller piece then it otherwise would have done
Note that with SQL2K5 there are some major improvements -- particularly with
#2 above. In SQL2K5, with MOLAP structures, we now build a historgam-like
structure which tells us which members in the various dimensions, that a
partition uses. You don't have to set the data slice, we figure it out
already for you. You can just process the partitions and off you go. You
still have to set the data slice for a partition in ROLAP storage mode or if
you are using proactive caching with a ROLAP phase. This is because a basic
tenet with ROLAP is that we never move the core fact table. Since this
automatic detection of members is done when the MOLAP structures are being
created from the fact data -- and the data isn't copied with ROLAP -- that
means that the histogram-like structures are never built.
This automatic building of the data slice is important because it allows a
new type of partitioning technique which wasn't possible in SQL2K. Suppose
you have the case where one member represents 50% of the data and you would
like to partition based on that member. You would like 1 partition with that
member as the data slice; and a second partition which contains everything
else. With SQL2K, this would be very difficult to setup. This is because
there is no nature of "everything else" as a data slice. You would have to
actually enumerate each of the "everything else" members on the data slice.
Since SQL2K5 automatically detects what members are in a partition, you just
process the partitions. One partition would have a WHERE clause in the query
binding including "AND fact.dim_fk = member_key" and the other partition
would have "AND fact.dim_fk <> member_key". And bingo -- it just works.
Hope this helps.
--
Dave Wickert [MSFT]
dwickert (AT) online (DOT) microsoft.com
Program Manager
BI SystemsTeam
SQL BI Product Unit (Analysis Services)
--
This posting is provided "AS IS" with no warranties, and confers no rights.
"Rea" <Rea (AT) discussions (DOT) microsoft.com> wrote
Quote:
Hey eb
In my cube time domension's leaf level is months and a new partition
is dedicated to each month.
Calculated measures then use data from a number of months(partitions)
,for example to sum up data from begining of year, or to calculate
avarage values per month etc.
In this case can partitioning data to months partitions,harm performance
instead of improoving??
Has any one had a case where partitions impared perforemance?
TIA |