dbTalk Databases Forums  

Better approach for common business problem ? (long post)

microsoft.public.sqlserver.olap microsoft.public.sqlserver.olap


Discuss Better approach for common business problem ? (long post) in the microsoft.public.sqlserver.olap forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Tom VdP
 
Posts: n/a

Default Better approach for common business problem ? (long post) - 10-04-2004 , 02:03 PM







[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
... x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #2  
Old   
Jamie
 
Posts: n/a

Default RE: Better approach for common business problem ? (long post) - 10-04-2004 , 02:53 PM






Tom,
I am wrestling with a similar problem myself at the moment and from what
I've just read I have more to learn from you than you have from me.

My best possible answers to your questions:
1) I have tried to fathom a different approach but every path led me back to
one of your proposed solutions.
2) Don't really know. All I can say is that forget any preconceived notion
you may have that the MDX WHERE clause is like the SQL WHERE clause - it
isn't.
Think about what this is doing. If you add an extra slicer then each tuple
that you are evaluating contains more non-All member co-ordinates (e.g.
(Dim.member, SlicerDim.member) rather than the simpler (Dim.member))
therefore it is less likely that the member has been pre-aggregated in the
cube.
Adding a slicer does NOT limit the amount of data being pulled back nor the
number of dimensions that define a cell. Each cell is dimensioned by every
dimension - by specifying a slicer all you are doing is telling AS not to use
the default member for that dimension.
This is my best guess and would welcome some other opinions on the subject.
3) Use MOLAP. Use the maximum number of aggregations possible. [I'm sure
you've done these things already.] Make sure the estimated row counts for
your fact tables are accurate therefore allowing AS to calculate better
aggregations.
4) No idea. Sorry.

Good luck in unearthing answers to these questions. I'm eager to see what
the answers are. Hopefully someone from the AS team in Microsoft will be
checking this out.

Regards
Jamie Thomson


"Tom VdP" wrote:

Quote:
[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #3  
Old   
Chris Webb
 
Posts: n/a

Default RE: Better approach for common business problem ? (long post) - 10-05-2004 , 04:27 AM



Hi Tom,

First of all, thanks for the detailed explanation. It makes it much easier
to suggest solutions...

On the modelling side, you might be interested to read the following posting
on how I generally approach this kind of problem
http://groups.google.co.uk/groups?hl...40TK2MSFTNGP12
With a little bit of work you would be able to adapt this to your own
scenario, which might allow you to run these kind of queries from any
front-end; was this requirement the reason why you built your own front-end?

Also, does your customer dimension have any levels or is it just a flat
list? If the latter, I would strongly recommend putting in two or three
levels to group the customers even if you don't have a business need for them
and end up making them invisible. It should also help.

Regarding MDX, the most efficient queries are always going to be the ones
using NONEMPTYCROSSJOIN rather than FILTER. Looking at your second query, you
might find that performance improves if you get rid of the unnecessary
EXTRACTs and put the contents of your WHERE clause into the NECJ function
too. Here's what it might look like (nb might be syntax errors!):

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


However, I'm not sure that any of the above will make a real difference to
performance, which is your biggest issue. If you've read the AS Performance
Guide you've probably already got a good partitioning strategy (by month I
assume) and run the Usage-Based Optimisation wizard. However, I think the
main reason you've got poor performance is that your queries run at the
Client level, which has 12000 members and is unlikely to have any
aggregations associated with it because of the 1/3 rule. In this case it
might be worth building your own aggregations at the Client level using the
Partition Manager tool available in the SQL 2000 Resource Kit, and seeing if
that improves matters (although it would increase the size of the cube and
processing times). For hints on how to know which aggregations to build when
you're doing it manually, have a look at this thread
http://groups.google.co.uk/groups?hl...com%26rnum%3D1

....and the following, which contains a better explanation of how to
understand the query log:
http://www.mosha.com/msolap/articles/querylog.htm

Because you have control over your front-end, it might also be worth playing
around with some connection string properties. The infamous Execution
Location/Default Isolation Mode pairing is always worth a try, as is trying
different values for Large Level Threshold. Also, in your case, try values of
0 to 7 for the Cache Policy property, and try values of 0.01 for CacheRatio
and CacheRatio2.

Finally, my last observation is that any report where you expect to get up
to 12000 rows and 120 columns is always going to perform slowly, and is more
data than anyone can take in. It needs to be broken down into more digestible
chunks --> smaller queries!

HTH,

Chris



"Tom VdP" wrote:

Quote:
[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #4  
Old   
Chris Webb
 
Posts: n/a

Default RE: Better approach for common business problem ? (long post) - 10-05-2004 , 08:05 AM



Oh, and one other important point - make sure you have SP3 PTS installed on
your client, it can sometimes make a big difference with performance.

"Chris Webb" wrote:

Quote:
Hi Tom,

First of all, thanks for the detailed explanation. It makes it much easier
to suggest solutions...

On the modelling side, you might be interested to read the following posting
on how I generally approach this kind of problem:
http://groups.google.co.uk/groups?hl...40TK2MSFTNGP12
With a little bit of work you would be able to adapt this to your own
scenario, which might allow you to run these kind of queries from any
front-end; was this requirement the reason why you built your own front-end?

Also, does your customer dimension have any levels or is it just a flat
list? If the latter, I would strongly recommend putting in two or three
levels to group the customers even if you don't have a business need for them
and end up making them invisible. It should also help.

Regarding MDX, the most efficient queries are always going to be the ones
using NONEMPTYCROSSJOIN rather than FILTER. Looking at your second query, you
might find that performance improves if you get rid of the unnecessary
EXTRACTs and put the contents of your WHERE clause into the NECJ function
too. Here's what it might look like (nb might be syntax errors!):

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


However, I'm not sure that any of the above will make a real difference to
performance, which is your biggest issue. If you've read the AS Performance
Guide you've probably already got a good partitioning strategy (by month I
assume) and run the Usage-Based Optimisation wizard. However, I think the
main reason you've got poor performance is that your queries run at the
Client level, which has 12000 members and is unlikely to have any
aggregations associated with it because of the 1/3 rule. In this case it
might be worth building your own aggregations at the Client level using the
Partition Manager tool available in the SQL 2000 Resource Kit, and seeing if
that improves matters (although it would increase the size of the cube and
processing times). For hints on how to know which aggregations to build when
you're doing it manually, have a look at this thread:
http://groups.google.co.uk/groups?hl...com%26rnum%3D1

...and the following, which contains a better explanation of how to
understand the query log:
http://www.mosha.com/msolap/articles/querylog.htm

Because you have control over your front-end, it might also be worth playing
around with some connection string properties. The infamous Execution
Location/Default Isolation Mode pairing is always worth a try, as is trying
different values for Large Level Threshold. Also, in your case, try values of
0 to 7 for the Cache Policy property, and try values of 0.01 for CacheRatio
and CacheRatio2.

Finally, my last observation is that any report where you expect to get up
to 12000 rows and 120 columns is always going to perform slowly, and is more
data than anyone can take in. It needs to be broken down into more digestible
chunks --> smaller queries!

HTH,

Chris



"Tom VdP" wrote:


[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #5  
Old   
Tom VdP
 
Posts: n/a

Default RE: Better approach for common business problem ? (long post) - 10-09-2004 , 11:13 AM




Hi Chris, Jamie, rest of the world,

Sorry for not responding earlier. Here is a roundup of the situation.

First of all, a bit of info about the front-end. The front-end is written
in Excel VBA because the customer wants quite a bit of specific behaviour and
layout. Rows and columns can be collapsed/expanded, context menus appear on
right-click that use the specific cell-location info (e.g. filtering, linking
to another report, sort options), popup windows (mainly for entering report
parameters), ... The front-end creates the necessary MDX dynamically.
This development costed about 2 manmonths of work.

Cubes:
There are 5 cubes, 2 virtual cubes, largest cube has 7 dimensions, 5
measures, 2 calculated measures. Dimension security in about 30 groups. The
amount of data is very little: 2 million fact rows. The cubes fit in 10Mb
files.

Necessity for optimisation:
Some reports return a lot of data: 12.000 rows x 120 columns. Loading this
in Excel is slow, but on top of that getting the data out of the cube also
takes a quite some time: up to 2 minutes. The returned cells are sums over
variable periods, so these must be calculated ad hoc. Loading the data and
formatting takes another minute or so.


About the optimisation proces:
First of all, I did read and tried to apply the Performance Guide. So the
tweaking with connection string options and so was done. Also the usage
based optimisation and other settings server side were applied. Latest
service packs applied everywhere.

Extra things I did because of the hints Chris gave:

1) MDX: use the NECJ version instead of the Filter() version
2) MDX : work away the Extract()
Indeed, I think I got confused by the examples in BOL and the fact that NECJ
must not necessarily return a typical crossjoined set. It can be a set
consisting of members of 1 dimension only.

--> result : no measurable differences


3) Performance monitor: measure DSN Used & DSN Requested.
VERY interesting! I noticed quite some room for uptimisation here!
So I added aggregations manually via the Partition Manager tool (for those
looking for it: it is available in the SQLServer2000 Resource Kit).
All queries that were logged had the same level DSN Used & DSN Requested.

--> result: disappointing! again no measurable difference.

(Sidenote question: how do you log ALL DSN Used & DSN Requested ? The
perfmon only polls once/second. And the querylog.mdb does not contain DSN
Used. What a mistake!)

4) Partition the cubes
Here I must admit that I still do not grasp the finesses of correct
partitioning... I partioned over months (-> 284 partitions!?)

--> result: much worse performance.


What I did not yet investigate is Cache Policy & Cache Ratio.


So all in all a disappointing result. Yet I must say that I am very glad I
did all this, because it makes me feel more confident when facing the
customer.


Still pending: why is it so much slower in a client/server environment than
in a single-pc setup ? Could it be because of the dimension security ?


Many thanks (especially to Chris!)
Tom


"Chris Webb" wrote:

Quote:
Oh, and one other important point - make sure you have SP3 PTS installed on
your client, it can sometimes make a big difference with performance.

"Chris Webb" wrote:

Hi Tom,

First of all, thanks for the detailed explanation. It makes it much easier
to suggest solutions...

On the modelling side, you might be interested to read the following posting
on how I generally approach this kind of problem:
http://groups.google.co.uk/groups?hl...40TK2MSFTNGP12
With a little bit of work you would be able to adapt this to your own
scenario, which might allow you to run these kind of queries from any
front-end; was this requirement the reason why you built your own front-end?

Also, does your customer dimension have any levels or is it just a flat
list? If the latter, I would strongly recommend putting in two or three
levels to group the customers even if you don't have a business need for them
and end up making them invisible. It should also help.

Regarding MDX, the most efficient queries are always going to be the ones
using NONEMPTYCROSSJOIN rather than FILTER. Looking at your second query, you
might find that performance improves if you get rid of the unnecessary
EXTRACTs and put the contents of your WHERE clause into the NECJ function
too. Here's what it might look like (nb might be syntax errors!):

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


However, I'm not sure that any of the above will make a real difference to
performance, which is your biggest issue. If you've read the AS Performance
Guide you've probably already got a good partitioning strategy (by month I
assume) and run the Usage-Based Optimisation wizard. However, I think the
main reason you've got poor performance is that your queries run at the
Client level, which has 12000 members and is unlikely to have any
aggregations associated with it because of the 1/3 rule. In this case it
might be worth building your own aggregations at the Client level using the
Partition Manager tool available in the SQL 2000 Resource Kit, and seeing if
that improves matters (although it would increase the size of the cube and
processing times). For hints on how to know which aggregations to build when
you're doing it manually, have a look at this thread:
http://groups.google.co.uk/groups?hl...com%26rnum%3D1

...and the following, which contains a better explanation of how to
understand the query log:
http://www.mosha.com/msolap/articles/querylog.htm

Because you have control over your front-end, it might also be worth playing
around with some connection string properties. The infamous Execution
Location/Default Isolation Mode pairing is always worth a try, as is trying
different values for Large Level Threshold. Also, in your case, try values of
0 to 7 for the Cache Policy property, and try values of 0.01 for CacheRatio
and CacheRatio2.

Finally, my last observation is that any report where you expect to get up
to 12000 rows and 120 columns is always going to perform slowly, and is more
data than anyone can take in. It needs to be broken down into more digestible
chunks --> smaller queries!

HTH,

Chris



"Tom VdP" wrote:


[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #6  
Old   
Chris Webb
 
Posts: n/a

Default RE: Better approach for common business problem ? (long post) - 10-12-2004 , 05:11 AM



Hi Tom,

First, I'll respond to your numbered points:
1,2) No surprise here that it didn't make much difference, unfortunately. It
was worth a try though.
3) If DSN Requested is always the same as DSN Used, then you're always
hitting an aggregation and there's no point building any more. However, can
you tell me how many subqueries each query results in? You can find this out
by looking at the Query Number counter. Lots of subqueries can sometimes mean
a certain network overhead, and the Cache Ratio and Cache Policy properties
can improve this. There is, as you've noticed, no satisfactory way of looking
at the DSN Used counter; I wrote a simple .NET app once which did this fairly
well, but not 100% accurately. Also, if lots of subqueries are generated then
you will find that turning off query logging on the server may have a
noticeable effect on query performance.
4) The fact that you got worse performance after partitioning is very
strange. Did you set the slice property on each partition properly? You can
check this by running the Edit wizard on the partition in Analysis Manager -
you set it on the second step. Partitioning for this kind of app should
improve performance, and I can only think that if you did set the slice then
the effect of reprocessing and therefore dropping the cache led to the
worsened performance.

The fact you have dimension security on your cube makes me think you might
not have as much memory available on the server for caching as you could. Can
you give us some more details on which dimensions you have it set up on? Is
it your client dimension?

Dimension security won't explain the difference between a client-server and
single PC setup - the issue there will be how much work is done on the
client, and how much data is moved over the network. What do you see in Task
Manager and the client and the server when you run queries in the
client-server setup? I suspect that the real cause of the performance problem
is that you have to sum up large amounts of data across your variable time
ranges, something which is always going to take a lot of time; if this is
taking place on the client rather than the server, then it will take even
longer.

Leading on from this topic, since you have control over the MDX used you
might be able to cut the amount of time needed for this summing by writing
clever MDX and building aggregations. The key question is: how large are the
time ranges that your users typically select? If they are larger than two or
three months then you could do the following:
1) Build some extra levels into your time dimension, maybe by grouping two
months together into a level of six members, and then grouping these members
together into twos making a level of three members. You should hide these
levels from the user.
2) In your MDX, whenever you have a time range that crosses a set of months
that all belong to one common parent, then instead of summing up the
individual months then include the common parent then substitute the common
parent instead. So, for example, imagine you were querying a range of months
1-9; instead of summing up nine months, if you had the dimension structure
proposed in (1) you would end up summing three members: month 9 and the
higher level members which are the common parents of months 1-4 and months
5-8. I wouldn't recommend trying to do this substitution in MDX - it will be
much easier to do in code in your front-end!
3) Build aggregations at these new higher levels (and the year level) so
that when your query hits them, it doesn't just go ahead and sum the data
from the month level anyway. The result of this will be that the majority of
the summing that you're doing at query time will no longer be necessary,
because most of it will already have been done when you're building the
aggregations. Using the example above, a sum of 9 members will be turned into
a sum of three members, hopefully meaning that the query only takes 1/3 of
the time.

Incidentally, are you connecting to the server 'natively' or using http? The
latter is less efficient than the former, and you should only use it if you
need to; there's also a Windows 2003 bug relating to it than can impact
performance too.

However, my overall feeling is that I'm not surprised your queries take 2
minutes if they return large amounts of data and need to do summing up as
well. Any report that returns 120*12000 cells isn't what I would define as an
OLAP-style report, and might well perform better in a relational database!
Are you sure you can't redesign the app to get it to return data in smaller
queries?

Regards,

Chris





"Tom VdP" wrote:

Quote:
Hi Chris, Jamie, rest of the world,

Sorry for not responding earlier. Here is a roundup of the situation.

First of all, a bit of info about the front-end. The front-end is written
in Excel VBA because the customer wants quite a bit of specific behaviour and
layout. Rows and columns can be collapsed/expanded, context menus appear on
right-click that use the specific cell-location info (e.g. filtering, linking
to another report, sort options), popup windows (mainly for entering report
parameters), ... The front-end creates the necessary MDX dynamically.
This development costed about 2 manmonths of work.

Cubes:
There are 5 cubes, 2 virtual cubes, largest cube has 7 dimensions, 5
measures, 2 calculated measures. Dimension security in about 30 groups. The
amount of data is very little: 2 million fact rows. The cubes fit in 10Mb
files.

Necessity for optimisation:
Some reports return a lot of data: 12.000 rows x 120 columns. Loading this
in Excel is slow, but on top of that getting the data out of the cube also
takes a quite some time: up to 2 minutes. The returned cells are sums over
variable periods, so these must be calculated ad hoc. Loading the data and
formatting takes another minute or so.


About the optimisation proces:
First of all, I did read and tried to apply the Performance Guide. So the
tweaking with connection string options and so was done. Also the usage
based optimisation and other settings server side were applied. Latest
service packs applied everywhere.

Extra things I did because of the hints Chris gave:

1) MDX: use the NECJ version instead of the Filter() version
2) MDX : work away the Extract()
Indeed, I think I got confused by the examples in BOL and the fact that NECJ
must not necessarily return a typical crossjoined set. It can be a set
consisting of members of 1 dimension only.

--> result : no measurable differences


3) Performance monitor: measure DSN Used & DSN Requested.
VERY interesting! I noticed quite some room for uptimisation here!
So I added aggregations manually via the Partition Manager tool (for those
looking for it: it is available in the SQLServer2000 Resource Kit).
All queries that were logged had the same level DSN Used & DSN Requested.

--> result: disappointing! again no measurable difference.

(Sidenote question: how do you log ALL DSN Used & DSN Requested ? The
perfmon only polls once/second. And the querylog.mdb does not contain DSN
Used. What a mistake!)

4) Partition the cubes
Here I must admit that I still do not grasp the finesses of correct
partitioning... I partioned over months (-> 284 partitions!?)

--> result: much worse performance.


What I did not yet investigate is Cache Policy & Cache Ratio.


So all in all a disappointing result. Yet I must say that I am very glad I
did all this, because it makes me feel more confident when facing the
customer.


Still pending: why is it so much slower in a client/server environment than
in a single-pc setup ? Could it be because of the dimension security ?


Many thanks (especially to Chris!)
Tom


"Chris Webb" wrote:

Oh, and one other important point - make sure you have SP3 PTS installed on
your client, it can sometimes make a big difference with performance.

"Chris Webb" wrote:

Hi Tom,

First of all, thanks for the detailed explanation. It makes it much easier
to suggest solutions...

On the modelling side, you might be interested to read the following posting
on how I generally approach this kind of problem:
http://groups.google.co.uk/groups?hl...40TK2MSFTNGP12
With a little bit of work you would be able to adapt this to your own
scenario, which might allow you to run these kind of queries from any
front-end; was this requirement the reason why you built your own front-end?

Also, does your customer dimension have any levels or is it just a flat
list? If the latter, I would strongly recommend putting in two or three
levels to group the customers even if you don't have a business need for them
and end up making them invisible. It should also help.

Regarding MDX, the most efficient queries are always going to be the ones
using NONEMPTYCROSSJOIN rather than FILTER. Looking at your second query, you
might find that performance improves if you get rid of the unnecessary
EXTRACTs and put the contents of your WHERE clause into the NECJ function
too. Here's what it might look like (nb might be syntax errors!):

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


However, I'm not sure that any of the above will make a real difference to
performance, which is your biggest issue. If you've read the AS Performance
Guide you've probably already got a good partitioning strategy (by month I
assume) and run the Usage-Based Optimisation wizard. However, I think the
main reason you've got poor performance is that your queries run at the
Client level, which has 12000 members and is unlikely to have any
aggregations associated with it because of the 1/3 rule. In this case it
might be worth building your own aggregations at the Client level using the
Partition Manager tool available in the SQL 2000 Resource Kit, and seeing if
that improves matters (although it would increase the size of the cube and
processing times). For hints on how to know which aggregations to build when
you're doing it manually, have a look at this thread:
http://groups.google.co.uk/groups?hl...com%26rnum%3D1

...and the following, which contains a better explanation of how to
understand the query log:
http://www.mosha.com/msolap/articles/querylog.htm

Because you have control over your front-end, it might also be worth playing
around with some connection string properties. The infamous Execution
Location/Default Isolation Mode pairing is always worth a try, as is trying
different values for Large Level Threshold. Also, in your case, try values of
0 to 7 for the Cache Policy property, and try values of 0.01 for CacheRatio
and CacheRatio2.

Finally, my last observation is that any report where you expect to get up
to 12000 rows and 120 columns is always going to perform slowly, and is more
data than anyone can take in. It needs to be broken down into more digestible
chunks --> smaller queries!

HTH,

Chris



"Tom VdP" wrote:


[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #7  
Old   
Chris Webb
 
Posts: n/a

Default RE: Better approach for common business problem ? (long post) - 10-15-2004 , 04:42 AM



If you're interested, here's the MDX to do the member substitution I talked
about in my last post, using the Foodmart Sales cube. It's a worked example
showing how an input set of {[Time].[1997].[Q1].[1]:[Time].[1998].[Q2].[5]}
can be transformed into a set containing {[Time].[1997], [Time].[1998].[Q1],
[Time].[1998].[Q2].[4], [Time].[1998].[Q2].[5]} - which, as I said, would be
much quicker to SUM if you had built the right aggregations on your cube.

WITH
/*The initial time range expressed in months*/
SET MYRANGE AS '{[Time].[1997].[Q1].[1]:[Time].[1998].[Q2].[5]}'
/*Find out which whole years are present*/
SET YEARSPRESENT AS '
/*Loop over all Years from the ancestor of the first month in the range to
the ancestor of the last month*/
GENERATE(
{ANCESTOR(MYRANGE.ITEM(0), [Time].[Year]) : ANCESTOR(TAIL(MYRANGE,
1).ITEM(0), [Time].[Year])}
/*Need to use nasty string functions to get IIF to return members or sets.
Yukon does not have this restriction!*/
, STRTOSET(
/*If all of the descendants at month level of the current year appear in the
range, we can replace them with the year member*/
IIF(COUNT(INTERSECT(MYRANGE, DESCENDANTS(TIME.CURRENTMEMBER,
[Time].[Month]))) = COUNT(DESCENDANTS(TIME.CURRENTMEMBER, [Time].[Month]))
, "{" + MEMBERTOSTR(TIME.CURRENTMEMBER) + "}"
, "{}")
)
)'
/*As an intermediate step, remove all the months from the initial range
which will be replaced by whole years*/
SET MONTHSLEFT AS 'EXCEPT(MYRANGE, DESCENDANTS(YEARSPRESENT,
[Time].[Month]))'
/*Now go through all the remaining months and see which ones have whole
quarters present, using the same logic as above*/
SET QUARTERSPRESENT AS 'GENERATE(
{ANCESTOR(MONTHSLEFT.ITEM(0), [Time].[Quarter]) : ANCESTOR(TAIL(MONTHSLEFT,
1).ITEM(0), [Time].[Quarter])}
, STRTOSET(
IIF(COUNT(INTERSECT(MONTHSLEFT, DESCENDANTS(TIME.CURRENTMEMBER,
[Time].[Month]))) = COUNT(DESCENDANTS(TIME.CURRENTMEMBER, [Time].[Month]))
, "{" + MEMBERTOSTR(TIME.CURRENTMEMBER) + "}"
, "{}")
)
)'
/*Finally return the new range, with years and quarters substituted for
months where possible*/
SET MYNEWRANGE AS '{YEARSPRESENT, QUARTERSPRESENT, EXCEPT(MYRANGE,
DESCENDANTS({YEARSPRESENT, QUARTERSPRESENT}, [Time].[Month]))}'
/*Helper calculated members to display the steps*/
MEMBER MEASURES.STEP1 AS 'SETTOSTR(MYRANGE)'
MEMBER MEASURES.STEP2 AS 'SETTOSTR(YEARSPRESENT)'
MEMBER MEASURES.STEP3 AS 'SETTOSTR(MONTHSLEFT)'
MEMBER MEASURES.STEP4 AS 'SETTOSTR(QUARTERSPRESENT)'
MEMBER MEASURES.STEP5 AS 'SETTOSTR(MYNEWRANGE)'
/*Show working*/
SELECT {MEASURES.STEP1, MEASURES.STEP2, MEASURES.STEP3, MEASURES.STEP4,
MEASURES.STEP5} ON 0
FROM SALES


Let me know if you have any questions...

"Chris Webb" wrote:

Quote:
Hi Tom,

First, I'll respond to your numbered points:
1,2) No surprise here that it didn't make much difference, unfortunately. It
was worth a try though.
3) If DSN Requested is always the same as DSN Used, then you're always
hitting an aggregation and there's no point building any more. However, can
you tell me how many subqueries each query results in? You can find this out
by looking at the Query Number counter. Lots of subqueries can sometimes mean
a certain network overhead, and the Cache Ratio and Cache Policy properties
can improve this. There is, as you've noticed, no satisfactory way of looking
at the DSN Used counter; I wrote a simple .NET app once which did this fairly
well, but not 100% accurately. Also, if lots of subqueries are generated then
you will find that turning off query logging on the server may have a
noticeable effect on query performance.
4) The fact that you got worse performance after partitioning is very
strange. Did you set the slice property on each partition properly? You can
check this by running the Edit wizard on the partition in Analysis Manager -
you set it on the second step. Partitioning for this kind of app should
improve performance, and I can only think that if you did set the slice then
the effect of reprocessing and therefore dropping the cache led to the
worsened performance.

The fact you have dimension security on your cube makes me think you might
not have as much memory available on the server for caching as you could. Can
you give us some more details on which dimensions you have it set up on? Is
it your client dimension?

Dimension security won't explain the difference between a client-server and
single PC setup - the issue there will be how much work is done on the
client, and how much data is moved over the network. What do you see in Task
Manager and the client and the server when you run queries in the
client-server setup? I suspect that the real cause of the performance problem
is that you have to sum up large amounts of data across your variable time
ranges, something which is always going to take a lot of time; if this is
taking place on the client rather than the server, then it will take even
longer.

Leading on from this topic, since you have control over the MDX used you
might be able to cut the amount of time needed for this summing by writing
clever MDX and building aggregations. The key question is: how large are the
time ranges that your users typically select? If they are larger than two or
three months then you could do the following:
1) Build some extra levels into your time dimension, maybe by grouping two
months together into a level of six members, and then grouping these members
together into twos making a level of three members. You should hide these
levels from the user.
2) In your MDX, whenever you have a time range that crosses a set of months
that all belong to one common parent, then instead of summing up the
individual months then include the common parent then substitute the common
parent instead. So, for example, imagine you were querying a range of months
1-9; instead of summing up nine months, if you had the dimension structure
proposed in (1) you would end up summing three members: month 9 and the
higher level members which are the common parents of months 1-4 and months
5-8. I wouldn't recommend trying to do this substitution in MDX - it will be
much easier to do in code in your front-end!
3) Build aggregations at these new higher levels (and the year level) so
that when your query hits them, it doesn't just go ahead and sum the data
from the month level anyway. The result of this will be that the majority of
the summing that you're doing at query time will no longer be necessary,
because most of it will already have been done when you're building the
aggregations. Using the example above, a sum of 9 members will be turned into
a sum of three members, hopefully meaning that the query only takes 1/3 of
the time.

Incidentally, are you connecting to the server 'natively' or using http? The
latter is less efficient than the former, and you should only use it if you
need to; there's also a Windows 2003 bug relating to it than can impact
performance too.

However, my overall feeling is that I'm not surprised your queries take 2
minutes if they return large amounts of data and need to do summing up as
well. Any report that returns 120*12000 cells isn't what I would define as an
OLAP-style report, and might well perform better in a relational database!
Are you sure you can't redesign the app to get it to return data in smaller
queries?

Regards,

Chris





"Tom VdP" wrote:


Hi Chris, Jamie, rest of the world,

Sorry for not responding earlier. Here is a roundup of the situation.

First of all, a bit of info about the front-end. The front-end is written
in Excel VBA because the customer wants quite a bit of specific behaviour and
layout. Rows and columns can be collapsed/expanded, context menus appear on
right-click that use the specific cell-location info (e.g. filtering, linking
to another report, sort options), popup windows (mainly for entering report
parameters), ... The front-end creates the necessary MDX dynamically.
This development costed about 2 manmonths of work.

Cubes:
There are 5 cubes, 2 virtual cubes, largest cube has 7 dimensions, 5
measures, 2 calculated measures. Dimension security in about 30 groups. The
amount of data is very little: 2 million fact rows. The cubes fit in 10Mb
files.

Necessity for optimisation:
Some reports return a lot of data: 12.000 rows x 120 columns. Loading this
in Excel is slow, but on top of that getting the data out of the cube also
takes a quite some time: up to 2 minutes. The returned cells are sums over
variable periods, so these must be calculated ad hoc. Loading the data and
formatting takes another minute or so.


About the optimisation proces:
First of all, I did read and tried to apply the Performance Guide. So the
tweaking with connection string options and so was done. Also the usage
based optimisation and other settings server side were applied. Latest
service packs applied everywhere.

Extra things I did because of the hints Chris gave:

1) MDX: use the NECJ version instead of the Filter() version
2) MDX : work away the Extract()
Indeed, I think I got confused by the examples in BOL and the fact that NECJ
must not necessarily return a typical crossjoined set. It can be a set
consisting of members of 1 dimension only.

--> result : no measurable differences


3) Performance monitor: measure DSN Used & DSN Requested.
VERY interesting! I noticed quite some room for uptimisation here!
So I added aggregations manually via the Partition Manager tool (for those
looking for it: it is available in the SQLServer2000 Resource Kit).
All queries that were logged had the same level DSN Used & DSN Requested.

--> result: disappointing! again no measurable difference.

(Sidenote question: how do you log ALL DSN Used & DSN Requested ? The
perfmon only polls once/second. And the querylog.mdb does not contain DSN
Used. What a mistake!)

4) Partition the cubes
Here I must admit that I still do not grasp the finesses of correct
partitioning... I partioned over months (-> 284 partitions!?)

--> result: much worse performance.


What I did not yet investigate is Cache Policy & Cache Ratio.


So all in all a disappointing result. Yet I must say that I am very glad I
did all this, because it makes me feel more confident when facing the
customer.


Still pending: why is it so much slower in a client/server environment than
in a single-pc setup ? Could it be because of the dimension security ?


Many thanks (especially to Chris!)
Tom


"Chris Webb" wrote:

Oh, and one other important point - make sure you have SP3 PTS installed on
your client, it can sometimes make a big difference with performance.

"Chris Webb" wrote:

Hi Tom,

First of all, thanks for the detailed explanation. It makes it much easier
to suggest solutions...

On the modelling side, you might be interested to read the following posting
on how I generally approach this kind of problem:
http://groups.google.co.uk/groups?hl...40TK2MSFTNGP12
With a little bit of work you would be able to adapt this to your own
scenario, which might allow you to run these kind of queries from any
front-end; was this requirement the reason why you built your own front-end?

Also, does your customer dimension have any levels or is it just a flat
list? If the latter, I would strongly recommend putting in two or three
levels to group the customers even if you don't have a business need for them
and end up making them invisible. It should also help.

Regarding MDX, the most efficient queries are always going to be the ones
using NONEMPTYCROSSJOIN rather than FILTER. Looking at your second query, you
might find that performance improves if you get rid of the unnecessary
EXTRACTs and put the contents of your WHERE clause into the NECJ function
too. Here's what it might look like (nb might be syntax errors!):

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


However, I'm not sure that any of the above will make a real difference to
performance, which is your biggest issue. If you've read the AS Performance
Guide you've probably already got a good partitioning strategy (by month I
assume) and run the Usage-Based Optimisation wizard. However, I think the
main reason you've got poor performance is that your queries run at the
Client level, which has 12000 members and is unlikely to have any
aggregations associated with it because of the 1/3 rule. In this case it
might be worth building your own aggregations at the Client level using the
Partition Manager tool available in the SQL 2000 Resource Kit, and seeing if
that improves matters (although it would increase the size of the cube and
processing times). For hints on how to know which aggregations to build when
you're doing it manually, have a look at this thread:
http://groups.google.co.uk/groups?hl...com%26rnum%3D1

...and the following, which contains a better explanation of how to
understand the query log:
http://www.mosha.com/msolap/articles/querylog.htm

Because you have control over your front-end, it might also be worth playing
around with some connection string properties. The infamous Execution
Location/Default Isolation Mode pairing is always worth a try, as is trying
different values for Large Level Threshold. Also, in your case, try values of
0 to 7 for the Cache Policy property, and try values of 0.01 for CacheRatio
and CacheRatio2.

Finally, my last observation is that any report where you expect to get up
to 12000 rows and 120 columns is always going to perform slowly, and is more
data than anyone can take in. It needs to be broken down into more digestible
chunks --> smaller queries!

HTH,

Chris



"Tom VdP" wrote:


[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom



Reply With Quote
  #8  
Old   
Tom VdP
 
Posts: n/a

Default Re: Better approach for common business problem ? (long post) - 10-17-2004 , 12:32 PM




Hi Chris!

Many thanks for all the hints.

Another Sunday afternoon of testing and implementing your suggested
changes has passed... This time with noticeable performance
increases!

Let me continue to use the numbering from the previous posts:
1)2)3) The DSN Requested wasn't optimized at all. It was after adding
the extra aggregates using the partition manager that DSN Requested
and DSN Used were (almost) equal. Yet, to my surprise, this did not
result in a significant performance increase.

The "trick" you propose to pre-aggregate sums using an extra level in
the time dimension looks very promising (I cannot test on a similar
environment as at the client's premises: tests are run at home). A
rather small query for months 1 to 8 that returns "only" 2530 lines x
16 columns results in 3445 queries being logged at the server. The
same query but now using quarters Q1 to Q2 results in only 1483
queries, almost halve as much! For some queries (other slicer
selection) the impact on performance is huge (22 mins down to 2 mins).
For most queries it is more like a factor between 20% to 50% though.
And then this was tested for months that fit nicely on quarter
boundaries (jan to aug = q1 to q2), so in real life it will be a bit
less.

The member substitution in MDX you wrote here is quite impressive :-)
I think we can can probably think of a way to do it in the front end
though.

To answer your questions and remarks:
- yes, we use a "native" connection;
- indeed, from a BI-perspective having a report with 12000 lines is
completely absurd. Yet I cannot convince the customer...
- partitioning: I confirm it has a negative impact in this case.
Isn't this normal ? What is the use of partitioning over months if
every result cell is a sum over months ? Then for each result cell a
"partition boundary" must be crossed... Ok, that is my intuitive
feeling :-) Hm, come to think of it: I should probably have tested
partitioning over quarters with the new setup.


Summary (for those following this thread) of the optimisations:
- extra aggregates: used DSN Requested & DSN Used performance counters
to monitor aggregates usage, used partition manager to add the missing
aggregates;
- extra aggregates in the time dimension: added an extra level
"quarter" on top of the month level, used partition manager to add the
aggregates on that level;
- MDX optimisation: use NonEmptyCrossjoin instead of filtering (minor
impact).


Regards,
Tom



On Fri, 15 Oct 2004 02:42:28 -0700, "Chris Webb"
<OnlyForPostingToNewsgroups (AT) hotmail (DOT) com> wrote:

Quote:
If you're interested, here's the MDX to do the member substitution I talked
about in my last post, using the Foodmart Sales cube. It's a worked example
showing how an input set of {[Time].[1997].[Q1].[1]:[Time].[1998].[Q2].[5]}
can be transformed into a set containing {[Time].[1997], [Time].[1998].[Q1],
[Time].[1998].[Q2].[4], [Time].[1998].[Q2].[5]} - which, as I said, would be
much quicker to SUM if you had built the right aggregations on your cube.

WITH
/*The initial time range expressed in months*/
SET MYRANGE AS '{[Time].[1997].[Q1].[1]:[Time].[1998].[Q2].[5]}'
/*Find out which whole years are present*/
SET YEARSPRESENT AS '
/*Loop over all Years from the ancestor of the first month in the range to
the ancestor of the last month*/
GENERATE(
{ANCESTOR(MYRANGE.ITEM(0), [Time].[Year]) : ANCESTOR(TAIL(MYRANGE,
1).ITEM(0), [Time].[Year])}
/*Need to use nasty string functions to get IIF to return members or sets.
Yukon does not have this restriction!*/
, STRTOSET(
/*If all of the descendants at month level of the current year appear in the
range, we can replace them with the year member*/
IIF(COUNT(INTERSECT(MYRANGE, DESCENDANTS(TIME.CURRENTMEMBER,
[Time].[Month]))) = COUNT(DESCENDANTS(TIME.CURRENTMEMBER, [Time].[Month]))
, "{" + MEMBERTOSTR(TIME.CURRENTMEMBER) + "}"
, "{}")
)
)'
/*As an intermediate step, remove all the months from the initial range
which will be replaced by whole years*/
SET MONTHSLEFT AS 'EXCEPT(MYRANGE, DESCENDANTS(YEARSPRESENT,
[Time].[Month]))'
/*Now go through all the remaining months and see which ones have whole
quarters present, using the same logic as above*/
SET QUARTERSPRESENT AS 'GENERATE(
{ANCESTOR(MONTHSLEFT.ITEM(0), [Time].[Quarter]) : ANCESTOR(TAIL(MONTHSLEFT,
1).ITEM(0), [Time].[Quarter])}
, STRTOSET(
IIF(COUNT(INTERSECT(MONTHSLEFT, DESCENDANTS(TIME.CURRENTMEMBER,
[Time].[Month]))) = COUNT(DESCENDANTS(TIME.CURRENTMEMBER, [Time].[Month]))
, "{" + MEMBERTOSTR(TIME.CURRENTMEMBER) + "}"
, "{}")
)
)'
/*Finally return the new range, with years and quarters substituted for
months where possible*/
SET MYNEWRANGE AS '{YEARSPRESENT, QUARTERSPRESENT, EXCEPT(MYRANGE,
DESCENDANTS({YEARSPRESENT, QUARTERSPRESENT}, [Time].[Month]))}'
/*Helper calculated members to display the steps*/
MEMBER MEASURES.STEP1 AS 'SETTOSTR(MYRANGE)'
MEMBER MEASURES.STEP2 AS 'SETTOSTR(YEARSPRESENT)'
MEMBER MEASURES.STEP3 AS 'SETTOSTR(MONTHSLEFT)'
MEMBER MEASURES.STEP4 AS 'SETTOSTR(QUARTERSPRESENT)'
MEMBER MEASURES.STEP5 AS 'SETTOSTR(MYNEWRANGE)'
/*Show working*/
SELECT {MEASURES.STEP1, MEASURES.STEP2, MEASURES.STEP3, MEASURES.STEP4,
MEASURES.STEP5} ON 0
FROM SALES


Let me know if you have any questions...

"Chris Webb" wrote:

Hi Tom,

First, I'll respond to your numbered points:
1,2) No surprise here that it didn't make much difference, unfortunately. It
was worth a try though.
3) If DSN Requested is always the same as DSN Used, then you're always
hitting an aggregation and there's no point building any more. However, can
you tell me how many subqueries each query results in? You can find this out
by looking at the Query Number counter. Lots of subqueries can sometimes mean
a certain network overhead, and the Cache Ratio and Cache Policy properties
can improve this. There is, as you've noticed, no satisfactory way of looking
at the DSN Used counter; I wrote a simple .NET app once which did this fairly
well, but not 100% accurately. Also, if lots of subqueries are generated then
you will find that turning off query logging on the server may have a
noticeable effect on query performance.
4) The fact that you got worse performance after partitioning is very
strange. Did you set the slice property on each partition properly? You can
check this by running the Edit wizard on the partition in Analysis Manager -
you set it on the second step. Partitioning for this kind of app should
improve performance, and I can only think that if you did set the slice then
the effect of reprocessing and therefore dropping the cache led to the
worsened performance.

The fact you have dimension security on your cube makes me think you might
not have as much memory available on the server for caching as you could. Can
you give us some more details on which dimensions you have it set up on? Is
it your client dimension?

Dimension security won't explain the difference between a client-server and
single PC setup - the issue there will be how much work is done on the
client, and how much data is moved over the network. What do you see in Task
Manager and the client and the server when you run queries in the
client-server setup? I suspect that the real cause of the performance problem
is that you have to sum up large amounts of data across your variable time
ranges, something which is always going to take a lot of time; if this is
taking place on the client rather than the server, then it will take even
longer.

Leading on from this topic, since you have control over the MDX used you
might be able to cut the amount of time needed for this summing by writing
clever MDX and building aggregations. The key question is: how large are the
time ranges that your users typically select? If they are larger than two or
three months then you could do the following:
1) Build some extra levels into your time dimension, maybe by grouping two
months together into a level of six members, and then grouping these members
together into twos making a level of three members. You should hide these
levels from the user.
2) In your MDX, whenever you have a time range that crosses a set of months
that all belong to one common parent, then instead of summing up the
individual months then include the common parent then substitute the common
parent instead. So, for example, imagine you were querying a range of months
1-9; instead of summing up nine months, if you had the dimension structure
proposed in (1) you would end up summing three members: month 9 and the
higher level members which are the common parents of months 1-4 and months
5-8. I wouldn't recommend trying to do this substitution in MDX - it will be
much easier to do in code in your front-end!
3) Build aggregations at these new higher levels (and the year level) so
that when your query hits them, it doesn't just go ahead and sum the data
from the month level anyway. The result of this will be that the majority of
the summing that you're doing at query time will no longer be necessary,
because most of it will already have been done when you're building the
aggregations. Using the example above, a sum of 9 members will be turned into
a sum of three members, hopefully meaning that the query only takes 1/3 of
the time.

Incidentally, are you connecting to the server 'natively' or using http? The
latter is less efficient than the former, and you should only use it if you
need to; there's also a Windows 2003 bug relating to it than can impact
performance too.

However, my overall feeling is that I'm not surprised your queries take 2
minutes if they return large amounts of data and need to do summing up as
well. Any report that returns 120*12000 cells isn't what I would define as an
OLAP-style report, and might well perform better in a relational database!
Are you sure you can't redesign the app to get it to return data in smaller
queries?

Regards,

Chris





"Tom VdP" wrote:


Hi Chris, Jamie, rest of the world,

Sorry for not responding earlier. Here is a roundup of the situation.

First of all, a bit of info about the front-end. The front-end is written
in Excel VBA because the customer wants quite a bit of specific behaviour and
layout. Rows and columns can be collapsed/expanded, context menus appear on
right-click that use the specific cell-location info (e.g. filtering, linking
to another report, sort options), popup windows (mainly for entering report
parameters), ... The front-end creates the necessary MDX dynamically.
This development costed about 2 manmonths of work.

Cubes:
There are 5 cubes, 2 virtual cubes, largest cube has 7 dimensions, 5
measures, 2 calculated measures. Dimension security in about 30 groups. The
amount of data is very little: 2 million fact rows. The cubes fit in 10Mb
files.

Necessity for optimisation:
Some reports return a lot of data: 12.000 rows x 120 columns. Loading this
in Excel is slow, but on top of that getting the data out of the cube also
takes a quite some time: up to 2 minutes. The returned cells are sums over
variable periods, so these must be calculated ad hoc. Loading the data and
formatting takes another minute or so.


About the optimisation proces:
First of all, I did read and tried to apply the Performance Guide. So the
tweaking with connection string options and so was done. Also the usage
based optimisation and other settings server side were applied. Latest
service packs applied everywhere.

Extra things I did because of the hints Chris gave:

1) MDX: use the NECJ version instead of the Filter() version
2) MDX : work away the Extract()
Indeed, I think I got confused by the examples in BOL and the fact that NECJ
must not necessarily return a typical crossjoined set. It can be a set
consisting of members of 1 dimension only.

--> result : no measurable differences


3) Performance monitor: measure DSN Used & DSN Requested.
VERY interesting! I noticed quite some room for uptimisation here!
So I added aggregations manually via the Partition Manager tool (for those
looking for it: it is available in the SQLServer2000 Resource Kit).
All queries that were logged had the same level DSN Used & DSN Requested.

--> result: disappointing! again no measurable difference.

(Sidenote question: how do you log ALL DSN Used & DSN Requested ? The
perfmon only polls once/second. And the querylog.mdb does not contain DSN
Used. What a mistake!)

4) Partition the cubes
Here I must admit that I still do not grasp the finesses of correct
partitioning... I partioned over months (-> 284 partitions!?)

--> result: much worse performance.


What I did not yet investigate is Cache Policy & Cache Ratio.


So all in all a disappointing result. Yet I must say that I am very glad I
did all this, because it makes me feel more confident when facing the
customer.


Still pending: why is it so much slower in a client/server environment than
in a single-pc setup ? Could it be because of the dimension security ?


Many thanks (especially to Chris!)
Tom


"Chris Webb" wrote:

Oh, and one other important point - make sure you have SP3 PTS installed on
your client, it can sometimes make a big difference with performance.

"Chris Webb" wrote:

Hi Tom,

First of all, thanks for the detailed explanation. It makes it much easier
to suggest solutions...

On the modelling side, you might be interested to read the following posting
on how I generally approach this kind of problem:
http://groups.google.co.uk/groups?hl...40TK2MSFTNGP12
With a little bit of work you would be able to adapt this to your own
scenario, which might allow you to run these kind of queries from any
front-end; was this requirement the reason why you built your own front-end?

Also, does your customer dimension have any levels or is it just a flat
list? If the latter, I would strongly recommend putting in two or three
levels to group the customers even if you don't have a business need for them
and end up making them invisible. It should also help.

Regarding MDX, the most efficient queries are always going to be the ones
using NONEMPTYCROSSJOIN rather than FILTER. Looking at your second query, you
might find that performance improves if you get rid of the unnecessary
EXTRACTs and put the contents of your WHERE clause into the NECJ function
too. Here's what it might look like (nb might be syntax errors!):

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


However, I'm not sure that any of the above will make a real difference to
performance, which is your biggest issue. If you've read the AS Performance
Guide you've probably already got a good partitioning strategy (by month I
assume) and run the Usage-Based Optimisation wizard. However, I think the
main reason you've got poor performance is that your queries run at the
Client level, which has 12000 members and is unlikely to have any
aggregations associated with it because of the 1/3 rule. In this case it
might be worth building your own aggregations at the Client level using the
Partition Manager tool available in the SQL 2000 Resource Kit, and seeing if
that improves matters (although it would increase the size of the cube and
processing times). For hints on how to know which aggregations to build when
you're doing it manually, have a look at this thread:
http://groups.google.co.uk/groups?hl...com%26rnum%3D1

...and the following, which contains a better explanation of how to
understand the query log:
http://www.mosha.com/msolap/articles/querylog.htm

Because you have control over your front-end, it might also be worth playing
around with some connection string properties. The infamous Execution
Location/Default Isolation Mode pairing is always worth a try, as is trying
different values for Large Level Threshold. Also, in your case, try values of
0 to 7 for the Cache Policy property, and try values of 0.01 for CacheRatio
and CacheRatio2.

Finally, my last observation is that any report where you expect to get up
to 12000 rows and 120 columns is always going to perform slowly, and is more
data than anyone can take in. It needs to be broken down into more digestible
chunks --> smaller queries!

HTH,

Chris



"Tom VdP" wrote:


[ This article is also available in a better readable format here:
http://users.pandora.be/see/mdx/news.htm ]


Given a cube with the following properties:
- a measure Amount
- a time dimension DimTime with a granularity of months (only Years and
Months, no lower levels)
- a client dimension DimClient
- a product dimension DimProduct (consisting of a level Product and a level
SubProduct)
- a few other dimensions

Consider the following requirement:
A report must be made that shows the total Amount in a given period compared
to the total Amount in the same period one year ago for each

client for each product.

The report has the following layout:
(hopefully you read this in a non-proportional font, if not consider the
link mentioned above)


Product A SubProduct A Product B
..
SumPeriod SumPeriod Y-1 SumPeriod SumPeriod Y-1 SumPeriod
SumPeriod Y-1 .. ..
All DimClient x y x y x
y x y
Client 1 x y x y x
y x y
Client 2 x y x y x
y x y
.. x y x y x
y x y


In pseudo-MDX: each x above = Sum(DimTime.startmonthimTime.endmonth,
Measures.[Amount]).
The start and end dates of the period are parameters: they are entered by
the user and injected into the MDX. This makes it impossible to

aggregate the sums beforehand.

The report should only include those clients that have at least one
(sub)product. Also only those (sub)products that are associated with

at least one client should be shown.

Because of the nature of this report a lot of data will be presented to the
user. The dimension DimClient contains 12000 elements,

DimProducts contains 60 elements. Depending on the chosen period and
possibly other slicers (among other dimensions) the report may

return up to 12000 rows and 120 columns.

This looks like a pretty standard report to me, yet I could not find an
MDX-pattern for this type of problem on the net. Neither did I

find a hint in Spofford's book.

Here are the 2 approaches I devised myself.


MDX 1 - Selection based on Filter()
-----------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

member [Measures].[AllClientPeriodSums] as '([DimClient].[All
DimClient],[Measures].[SumPeriod])+([DimClient].[All

DimClient],[Measures].[SumPeriod Y-1])'

select
{
Crossjoin(
Hierarchize(
Filter(
[DimProduct].[SubProduct].members + [DimProduct].[Product].members
,not(isempty([Measures].[AllClientPeriodSums]))))
,{[Measures].[SumPeriod],[Measures].[SumPeriod Y-1]})
} on columns,
{
Filter([DimClient].members,([Measures].[SumPeriod]<>0 or
[Measures].[SumPeriod Y-1]<>0))
} on rows
from TheCube
where ([DimTime].[2004],[DimSomeOtherDimension].[SomeMember])


The above in words:
- Filter all Products and SubProducts for which the "All" level in DimClient
has a non-empty value for the sums over the sets

SelectedMonths or PrevSelectedMonths. If that value is non-empty, at least
1 client has a value for that (sub)product, so that

(sub)product must be retained.
- Filter all Clients based on the test if the sums are <> 0. (this should
be: not(isempty(SumPeriod) or isempty(SumPeriod Y-1)), but due

to extra requirements of our users it is implemented like this).



MDX 2 - Selection based on NonEmptyCrossjoin()
----------------------------------------------

with
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimClient].Members,{[Measures].[Amount]},2)
,[DimClient]))'

set [x-axis-products]
as 'Hierarchize(
Generate(
Extract(

NonEmptyCrossjoin([SelectedMonths]+[PrevSelectedMonths],[DimProduct].[SubProduct].members,{[Measures].[Amount]},2)
,[DimProduct])
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent}))'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


The above in words:
- Find non-empty crossings between the period sets and clients based on the
Amount measure. From this set, extract the Clients and

hierarchize. These are the clients that go on the Y-axis of the report.
- Find non-empty crossings between the period sets and subproducts based on
the Amount measure. Extract the Product members from that

set. Add the parents, i.e. the Products and hierarchize. These are the
products that will go on the X-axis of the report.


Questions
---------

1) Both solutions are rather slow. Is there an MDX 3 I should implement
instead ? I there a better approach ? I am rather new to MDX

and I find it dazzling at times...

2) For some slicer elements (the [DimSomeOtherDimension].[SomeMember] above)
the report is up to 60% slower than without any slicer

element at all! Without the where-clause in the MDX the report executes
faster than for MDX with a where-clause that limits the data to

some 70%. What is happening ??

3) Is there a way to incorporate via cube-design a better/faster solution ?
Keep in mind that both start and end months are variables.

4) [May be unrelated to this particular report] I noticed that even on a
medium sized desktop pc that runs Analysis Services locally the

performance is way better than in a client/server setup with a high-end
desktop and a high-end server... This has me baffled. The

network usage is low and never a bottleneck. The server is dedicated and
there are no other users executing reports. We use Excel VBA

with ADO MD to get to our data. I did follow all steps in the Microsoft
whitepaper on Analysis Services performance.

Many, many thanks for reading up to this point!
Many more for all helpful information and remarks.
Tom




Reply With Quote
  #9  
Old   
Chris Webb
 
Posts: n/a

Default Re: Better approach for common business problem ? (long post) - 10-18-2004 , 04:29 AM



Hi Tom,

I'm glad to hear we're making progress...!

First of all, regarding the aggregates, the fact that performance didn't
increase very much when you built the new ones only strengthens my opinion
that your main problem is the amount of summing up you need to do. The fact
that you are generating thousands of subqueries is also significant - if you
haven't tried these properties already in your connection string, try using
the following together or separately:
Cache Ratio=0.01
Cache Ratio2=0.01
Cache Policy = 5 or 6 or 7 (in theory 7 should be the one that makes the
difference)

I've found that various combinations of these reduce the number of
subqueries sent to the server, and therefore speed up overall query
performance. And also, as I think I said in an earlier mail, make sure you
have turned off query logging on the server - with this many subqueries,
logging every single one can introduce a fairly big overhead.

Regarding the extra levels, the reason I proposed two hidden levels rather
than a single quarter level between month and year is that it should increase
the opportunities you have for optimisation - you have two sets of
aggregations you could hit instead of one. But this is something you'll need
to test yourself to see if the impact is worth the extra work.

Last of all, partitioning should always have positive impact on this type of
query, especially if you are partitioning by month. By setting the data slice
property on a partition in your scenario, you're saying to the AS engine that
instead of having to search through the whole cube for data for the month
January 1999 (for example), it can go directly to the partition you have
specified. Therefore the amount of time actually taken to read the data off
disk is greatly reduced. There are two possible reasons why you're seeing a
performance decrease, in my opinion:
1) You haven't set the data slice property properly (although I'm sure you
have!)
2) When you reprocessed the cube to include the new partitions, and reran
your test queries, the cache on the server was empty (as a result of the
reprocess) and you were comparing this performance with queries run on your
old cube when the cache was pre-populated with useful data. What I generally
do is stop and start the AS service and close and reopen the client
connection before each query run when I'm doing performance testing to try to
reduce the impact of caching as much as possible.
However, as we already established when looking at your aggregations, it's
the summing up of data and not the reading from disk that is your problem, so
I would imagine that any benefit partitioning will bring you will be
relatively minor. I would definitely keep it in your cube design though.

Regards,

Chris





Reply With Quote
  #10  
Old   
Chris Webb
 
Posts: n/a

Default Re: Better approach for common business problem ? (long post) - 10-19-2004 , 03:55 AM



At the risk of filling up your weekends until Xmas, I've thought of another
thing you might want to try - this query might be one of the few cases where
a WITH CACHE clause makes a difference. This populates the cache on the
client with data before the query is run. Try adding something like the
following to the beginning of your query (using the query I gave you a few
posts ago as a template):

with
//STILL NEEDS TO BE OPTIMISED
set [SelectedMonths] as '[DimTime].&[2004_01]:[DimTime].&[2004_08]'
set [PrevSelectedMonths] as

'ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_01]):ParallelPeriod([DimTime].[Year],1,[DimTime].&[2004_08])'

member [Measures].[SumPeriod]
as 'sum([SelectedMonths], [Measures].[Amount])'

member [Measures].[SumPeriod Y-1]
as 'sum([PrevSelectedMonths], [Measures].[Amount])'

set [y-axis]
as 'Hierarchize
NonEmptyCrossjoin([DimClient].Members,{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1))'

set [x-axis-products]
as 'Hierarchize(
Generate(
NonEmptyCrossjoin([DimProduct].[SubProduct].members,
{[SelectedMonths]+[PrevSelectedMonths]},{[Measures].[Amount]},{[DimSomeOtherDimension].[SomeMember]},1)
,{[DimProduct].CurrentMember,[DimProduct].CurrentMember.Parent})'

//NEW ADDITION TO QUERY
CACHE AS '({SELECTEDMONTHS, PREVSELECTEDMONTHS}, [x-axis-products],
[y-axis], {[DimSomeOtherDimension].[SomeMember]})'

select
{
Crossjoin([x-axis-products], {[Measures].[SumPeriod],
[Measures].[SumPeriod Y-1]})
} on columns,
{
[y-axis]
} on rows
from TheCube
where ([DimSomeOtherDimension].[SomeMember])


I would try this both with and without the connection string properties I
mentioned in the last post.

HTH,

Chris

"Chris Webb" wrote:

Quote:
Hi Tom,

I'm glad to hear we're making progress...!

First of all, regarding the aggregates, the fact that performance didn't
increase very much when you built the new ones only strengthens my opinion
that your main problem is the amount of summing up you need to do. The fact
that you are generating thousands of subqueries is also significant - if you
haven't tried these properties already in your connection string, try using
the following together or separately:
Cache Ratio=0.01
Cache Ratio2=0.01
Cache Policy = 5 or 6 or 7 (in theory 7 should be the one that makes the
difference)

I've found that various combinations of these reduce the number of
subqueries sent to the server, and therefore speed up overall query
performance. And also, as I think I said in an earlier mail, make sure you
have turned off query logging on the server - with this many subqueries,
logging every single one can introduce a fairly big overhead.

Regarding the extra levels, the reason I proposed two hidden levels rather
than a single quarter level between month and year is that it should increase
the opportunities you have for optimisation - you have two sets of
aggregations you could hit instead of one. But this is something you'll need
to test yourself to see if the impact is worth the extra work.

Last of all, partitioning should always have positive impact on this type of
query, especially if you are partitioning by month. By setting the data slice
property on a partition in your scenario, you're saying to the AS engine that
instead of having to search through the whole cube for data for the month
January 1999 (for example), it can go directly to the partition you have
specified. Therefore the amount of time actually taken to read the data off
disk is greatly reduced. There are two possible reasons why you're seeing a
performance decrease, in my opinion:
1) You haven't set the data slice property properly (although I'm sure you
have!)
2) When you reprocessed the cube to include the new partitions, and reran
your test queries, the cache on the server was empty (as a result of the
reprocess) and you were comparing this performance with queries run on your
old cube when the cache was pre-populated with useful data. What I generally
do is stop and start the AS service and close and reopen the client
connection before each query run when I'm doing performance testing to try to
reduce the impact of caching as much as possible.
However, as we already established when looking at your aggregations, it's
the summing up of data and not the reading from disk that is your problem, so
I would imagine that any benefit partitioning will bring you will be
relatively minor. I would definitely keep it in your cube design though.

Regards,

Chris





Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.