dbTalk Databases Forums  

median as a derived measure

microsoft.public.sqlserver.olap microsoft.public.sqlserver.olap


Discuss median as a derived measure in the microsoft.public.sqlserver.olap forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
ObscuroMondo
 
Posts: n/a

Default median as a derived measure - 09-12-2005 , 08:42 AM






Greetings all,

I'm an AS n00b and my ignorance is profound. Hope somebody here can help me
out.

We're trying to speed up generation of sales reports for a run-of-the-mill
POS system, nothing fancy. We're using Reporting Services and are now
calculating median item selling price at report runtime by means of a T-SQL
UDF. This UDF calculates the median price at each level of the org unit
hierarchy: store, municipality, district, region, and enterprise. With
millions of POS transaction records and thousands of inventory items, this
takes a long, long time and users are complaining.

We hoped that by basing the reports on results obtained from Analysis
Services we could move from calculating the median at runtime to getting it
from a derived measure calculated at cube processing time. We have a sort-of
solution for a calculated measure although only the leaf level gets the
correct result:

Median(NonEmptyCrossJoin(Descendants([Sales].CurrentMember,,LEAVES),
Descendants([Stores].CurrentMember,,LEAVES)), [Measures].[Item Price])

Based on my experimentation, only T-SQL functions can appear as part of a
derived measure. SUM, AVG and the like are permitted, but UDFs cannot be
used in the definition of a derived member, correct?

We want to use AS to precalculate median and other statistical measures so
that we're only doing lookups at report generation time. Can this be done?
If so, how?

Many thanks for your time and help
OM


Reply With Quote
  #2  
Old   
SQL McOLAP
 
Posts: n/a

Default RE: median as a derived measure - 09-12-2005 , 11:29 AM






OM -

Median is tough no matter what the technology is because of what it has to
do. All detail records have to be sorted first, and then it has to find the
middle record's value (avg two records if two are in the middle).

Unlike Oracle, there's no native MEDIAN function in SQL Server, but I've
seen MEDIAN be pretty slow in large Oracle tables as well. I'm guessing your
UDF is doint all the MEDIAN work.

My questions to start are this:

Do you want the median of the item's price at the lowest level (store), no
matter what? And is time part of this equation as well? Does it have to be
by day, too? I'm guessing you would because if you took it at a higher level
and it's a sum aggregate, those prices would be summed up the hierarchy, and
I'm guessing that would be incorrect. The problem you'll have with this, is
that the higher up in the hierarchies you go, the more records will have to
be dynamically sorted for the MEDIAN function to process, and this could
really chug.

Also, what is your "Sales" dimension here? My hope is that you don't have an
invoice or account type dimension, because they really aren't beneficial for
OLAP analysis. Is it the product dimension you mentioned?

There are a couple of things I'd suggest, some on the SQL side, some on the
OLAP side, and a weird combination if my hunch follows what you're doing.

Let me know the answers to the above questions, and I'll let you know what I
was thinking.

- Phil


"ObscuroMondo" wrote:

Quote:
Greetings all,

I'm an AS n00b and my ignorance is profound. Hope somebody here can help me
out.

We're trying to speed up generation of sales reports for a run-of-the-mill
POS system, nothing fancy. We're using Reporting Services and are now
calculating median item selling price at report runtime by means of a T-SQL
UDF. This UDF calculates the median price at each level of the org unit
hierarchy: store, municipality, district, region, and enterprise. With
millions of POS transaction records and thousands of inventory items, this
takes a long, long time and users are complaining.

We hoped that by basing the reports on results obtained from Analysis
Services we could move from calculating the median at runtime to getting it
from a derived measure calculated at cube processing time. We have a sort-of
solution for a calculated measure although only the leaf level gets the
correct result:

Median(NonEmptyCrossJoin(Descendants([Sales].CurrentMember,,LEAVES),
Descendants([Stores].CurrentMember,,LEAVES)), [Measures].[Item Price])

Based on my experimentation, only T-SQL functions can appear as part of a
derived measure. SUM, AVG and the like are permitted, but UDFs cannot be
used in the definition of a derived member, correct?

We want to use AS to precalculate median and other statistical measures so
that we're only doing lookups at report generation time. Can this be done?
If so, how?

Many thanks for your time and help
OM


Reply With Quote
  #3  
Old   
ObscuroMondo
 
Posts: n/a

Default RE: median as a derived measure - 09-12-2005 , 01:25 PM



"SQL McOLAP" wrote:

/* outstanding moniker */

Quote:
OM -

Median is tough no matter what the technology is because of what it has to
do. All detail records have to be sorted first, and then it has to find the
middle record's value (avg two records if two are in the middle).
Many thanks for the rapid reply, Phil. I've watched our UDF run on a dev
PC, over a measly 100,000 transaction records and 2,000 SKUs, for 22 minutes.
It's a lot of work.

Quote:
Unlike Oracle, there's no native MEDIAN function in SQL Server, but I've
seen MEDIAN be pretty slow in large Oracle tables as well. I'm guessing your
UDF is doint all the MEDIAN work.
You are correct, sir. Me and the rest of the world wish there was a native
median in T-SQL.

Quote:
My questions to start are this:

Do you want the median of the item's price at the lowest level (store), no
matter what?
Yes.

Quote:
And is time part of this equation as well? Does it have to be
by day, too? I'm guessing you would because if you took it at a higher level
and it's a sum aggregate, those prices would be summed up the hierarchy, and
I'm guessing that would be incorrect.
Yes, and yes. Prices can vary arbitrarily at the store level even within
the same day. You are also right about the incorrect aggregation, which is
why I wanted to calculate the median before the aggregations are calculated.

Quote:
The problem you'll have with this, is
that the higher up in the hierarchies you go, the more records will have to
be dynamically sorted for the MEDIAN function to process, and this could
really chug.
That is true, and our current org-level-aware UDF is really really slow as
noted above. You watch that little green Reporting Services pinwheel going
around for quite a while, even for small datasets. For real datasets on a
dev box you could go to the Bahamas and back by the time it's done.

Quote:
Also, what is your "Sales" dimension here? My hope is that you don't have an
invoice or account type dimension, because they really aren't beneficial for
OLAP analysis. Is it the product dimension you mentioned?
Each record in the Sales dimension represents a single line item from a
point-of-sale transaction. So for example if you sell 6 1/4-20 2" bolts, you
get a record with the price of a single bolt and the quantity sold, 6 in this
case. This business model is a little maddening because there is no control
(none!) over prices at any but the Store (leaf) level of the org unit
hierarchy, and managers can change prices at any time. Just because you sold
a 1/4-20 bolt for $0.25 10 minutes ago doesn't mean that's what it costs
right now. So you have to get your pricing information from the actual
individual sales transactions. Lotta work, even if the finest Time
granularity is Day, which it is.

I have in essence made the fact table into a dimension, so I can navigate
the sales data. Seemed like a good idea at the time because of the wildly
varying per-item pricing. Will happily admit to bad thinking and ignorance,
just want to get the thing running at acceptable speed, or be convinced that
It Can't Be Done.

Quote:
There are a couple of things I'd suggest, some on the SQL side, some on the
OLAP side, and a weird combination if my hunch follows what you're doing.

Let me know the answers to the above questions, and I'll let you know what I
was thinking.

- Phil

Many thanks once again
OM


Reply With Quote
  #4  
Old   
SQL McOLAP
 
Posts: n/a

Default RE: median as a derived measure - 09-13-2005 , 08:40 AM



OM -

A question to ask the business. Is MEDIAN necessary? Is AVG completely out
of the question? Not to dodge the requirement, but I've consulted at places
where the business swore that MEDIAN was necessary, but after further
analysis found that the difference between AVG and MEDIAN was immaterial,
especially with larger datasets and "normal" data (where the STDEV wasn't
huge). If the business was willing to do some analysis on this data, they
may find the same thing. If so, a simple "count" column with just a "1" in
it could easily be put in a SQL view for your fact table, and then you could
create an easy calc member that divided the price by that count. This will
always be very fast in the cube, and you won't have to worry about different
levels or granularity.

Having said that, the following is an attempt to help you with MEDIAN.

Unfortunately, you're in a real "pay me now, or pay me later" scenario. If
you want MEDIAN to be even remotely performant in reporting over large
amounts of fact data, you'll probably have to dedicate some "up front"
processing to get that MEDIAN, and then store it. And this processing will
take time, as you're seeing.

Find out which levels must absolutely have this median calculation and
create summary SQL tables for that level. For example, if they want
day/store/product, create a SQL summary table for that. If they want
month/store/product, create one for that too. If day/division/product is
desired, etc... This is daunting, don't get me wrong, but due to what MEDIAN
has to do, calculating and storing this value first will make querying much
faster later on. Another problem would be getting the business to agree on
these finite levels beforehand. If they ask for "all of them" which
businesses often do not realizing the process window repurcussions, just do a
few of them and explain how long it took. They'll probably understand.

The above was a SQL solution. On the OLAP side, with your current SQL and
OLAP structure (without the summary tables described above), try finding a
"sweetspot." Allow the median to be done only at a certain level, if
possible. In other words, in an IIF statement in your calc member, only do
the MEDIAN if you're at day/store/product, or month/store/product. You may
find at lower levels the performance isn't "that bad," but test it out.
Maybe month/store/product isn't "that bad" either. I have a feeling, though,
if the data set is large, you might be stuck at anything higher than that
lowest level. Another obvious problem, is that the business may want to see
the median above that level as well. Let me know if you want some sample MDX
for this.

A "combined" solution would be to create the SQL tables in the first
suggestion, and then build cubes off each of them, keeping all other dims the
same, but only enabling the levels germain to that cube for the 3 mentioned
dims. You could then join these cubes in a virtual cube. The problem is
that measure names have to be unique, but you could dance around this by
giving them unique names in their respective cubes (like product day store
median price, etc) and then have a calculated member in the virtual cube that
selects the appropriate one depending on which level the user is on so it's
seamless to the end user. All this would only be necessary if OLAP reporting
has to be done. It sounds like you're already doing SQL reporting for this,
and if that's OK, reporting services reports against the summary tables might
suffice.

I apologize for any disappointment caused by these suggestions, but I just
don't know of any tricks for MEDIAN, being the processing intensive stat that
it is.

Somethings to look at in your UDF, be sure it's using an index to do it's
task. If not put an index on those columns. If one exists, but it's not
being used, kick it in the pants with a hint. If you're cursoring to get
your MEDIAN in the UDF, this should make that returned recordset come back
faster. Not knowing what else goes on in your UDF, I'm not sure what else to
suggest.

Ways to optimize your cube would be to partition it by something queried
often and load balanced. Time's usually a good candidate if the data isn't
too seasonal. Partitions are great for processing (multiple partitions can
be processed in parallel) and great for querying since smaller pieces have to
be evaluated. Be sure to set your slice on each partition, or performance
could actually be worse. In a partitioned cube, the MEDIAN could have many
threads going after smaller amounts of data so that could speed up this
dynamic work as well. Just something else to consider.

Let me know if you have any questions on all this babbling. You have an
interesting challenge.

Good luck.

- Phil

"ObscuroMondo" wrote:

Quote:
"SQL McOLAP" wrote:

/* outstanding moniker */

OM -

Median is tough no matter what the technology is because of what it has to
do. All detail records have to be sorted first, and then it has to find the
middle record's value (avg two records if two are in the middle).

Many thanks for the rapid reply, Phil. I've watched our UDF run on a dev
PC, over a measly 100,000 transaction records and 2,000 SKUs, for 22 minutes.
It's a lot of work.


Unlike Oracle, there's no native MEDIAN function in SQL Server, but I've
seen MEDIAN be pretty slow in large Oracle tables as well. I'm guessing your
UDF is doint all the MEDIAN work.

You are correct, sir. Me and the rest of the world wish there was a native
median in T-SQL.

My questions to start are this:

Do you want the median of the item's price at the lowest level (store), no
matter what?

Yes.

And is time part of this equation as well? Does it have to be
by day, too? I'm guessing you would because if you took it at a higher level
and it's a sum aggregate, those prices would be summed up the hierarchy, and
I'm guessing that would be incorrect.

Yes, and yes. Prices can vary arbitrarily at the store level even within
the same day. You are also right about the incorrect aggregation, which is
why I wanted to calculate the median before the aggregations are calculated.

The problem you'll have with this, is
that the higher up in the hierarchies you go, the more records will have to
be dynamically sorted for the MEDIAN function to process, and this could
really chug.

That is true, and our current org-level-aware UDF is really really slow as
noted above. You watch that little green Reporting Services pinwheel going
around for quite a while, even for small datasets. For real datasets on a
dev box you could go to the Bahamas and back by the time it's done.


Also, what is your "Sales" dimension here? My hope is that you don't have an
invoice or account type dimension, because they really aren't beneficial for
OLAP analysis. Is it the product dimension you mentioned?

Each record in the Sales dimension represents a single line item from a
point-of-sale transaction. So for example if you sell 6 1/4-20 2" bolts, you
get a record with the price of a single bolt and the quantity sold, 6 in this
case. This business model is a little maddening because there is no control
(none!) over prices at any but the Store (leaf) level of the org unit
hierarchy, and managers can change prices at any time. Just because you sold
a 1/4-20 bolt for $0.25 10 minutes ago doesn't mean that's what it costs
right now. So you have to get your pricing information from the actual
individual sales transactions. Lotta work, even if the finest Time
granularity is Day, which it is.

I have in essence made the fact table into a dimension, so I can navigate
the sales data. Seemed like a good idea at the time because of the wildly
varying per-item pricing. Will happily admit to bad thinking and ignorance,
just want to get the thing running at acceptable speed, or be convinced that
It Can't Be Done.

There are a couple of things I'd suggest, some on the SQL side, some on the
OLAP side, and a weird combination if my hunch follows what you're doing.

Let me know the answers to the above questions, and I'll let you know what I
was thinking.

- Phil


Many thanks once again
OM

Reply With Quote
  #5  
Old   
ObscuroMondo
 
Posts: n/a

Default RE: median as a derived measure - 09-13-2005 , 04:47 PM



Many thanks for your most excellent reply. "Babbling," ha! Reply inline.

"SQL McOLAP" wrote:

Quote:
OM -

A question to ask the business. Is MEDIAN necessary? Is AVG completely out
of the question? Not to dodge the requirement, but I've consulted at places
where the business swore that MEDIAN was necessary, but after further
analysis found that the difference between AVG and MEDIAN was immaterial,
especially with larger datasets and "normal" data (where the STDEV wasn't
huge). If the business was willing to do some analysis on this data, they
may find the same thing. If so, a simple "count" column with just a "1" in
it could easily be put in a SQL view for your fact table, and then you could
create an easy calc member that divided the price by that count. This will
always be very fast in the cube, and you won't have to worry about different
levels or granularity.
They want both, but median has been the hard one as you point out. I am
going to do as you suggest and ask them "what benefit do you get from median
that you don't get from average?" You could give them every conceivable
statistical measure and I'm not sure it's going to help them much, given the
nature of the reports they supplied in prototype as Excel spreadsheets.
Their prototype reports are strictly static snapshots of measures calculated
over user-selectable time periods. To draw useful inferences from their
reports you have to generate a collection of instances of a given report, run
over various time periods, and then traverse the collection "by hand" looking
for interesting things.

They would be much better served by running a 3rd-party front end against
well-designed and implemented AS cubes, and I'm pushing that idea, but in the
meantime we need to generate their reports as quickly as possible.

Quote:
Having said that, the following is an attempt to help you with MEDIAN.

Unfortunately, you're in a real "pay me now, or pay me later" scenario. If
you want MEDIAN to be even remotely performant in reporting over large
amounts of fact data, you'll probably have to dedicate some "up front"
processing to get that MEDIAN, and then store it. And this processing will
take time, as you're seeing.
<some SQL McOLAP content elided for brevity>

That's the general approach I've taken. We're trying to pipeline the
statistical calculations, even if it means giving up some numerical
precision. So we've been looking for ways for municipality, district, and
region servers to generate daily stats and include them in the data warehouse
dump each day. While taking the median of a price across Districts to
represent the Region median price is obviously not the same as really
calculating the Region median from the entire sample, is it possibly "close
enough for rock'n'roll?" Of course the business must decide, not us.

Quote:
The above was a SQL solution. On the OLAP side, with your current SQL and
OLAP structure (without the summary tables described above), try finding a
"sweetspot." Allow the median to be done only at a certain level, if
possible. In other words, in an IIF statement in your calc member, only do
the MEDIAN if you're at day/store/product, or month/store/product. You may
find at lower levels the performance isn't "that bad," but test it out.
Maybe month/store/product isn't "that bad" either. I have a feeling, though,
if the data set is large, you might be stuck at anything higher than that
lowest level. Another obvious problem, is that the business may want to see
the median above that level as well. Let me know if you want some sample MDX
for this.
Yes! Yes! Please. That's one of our requirements.

Quote:
A "combined" solution would be to create the SQL tables in the first
suggestion, and then build cubes off each of them, keeping all other dims the
same, but only enabling the levels germain to that cube for the 3 mentioned
dims. You could then join these cubes in a virtual cube.
<some SQL McOLAP content elided for brevity>

Quote:
All this would only be necessary if OLAP reporting
has to be done. It sounds like you're already doing SQL reporting for this,
and if that's OK, reporting services reports against the summary tables might
suffice.

That's what I've been thinking too: why use Analysis Services if a) all we
want are these static reports and b) we can get satisfactory performance from
precalculated values? If we want to really USE the data then use AS in
conjunction with a front end. If we're going to burn cycles to precalculate
stuff in SQL, why burn more to process a cube for minimal performance gain if
we're not really using the cube?

Quote:
I apologize for any disappointment caused by these suggestions, but I just
don't know of any tricks for MEDIAN, being the processing intensive stat that
it is.
No disappointment at all, believe me. Instead, a tremendous help.

Quote:
Somethings to look at in your UDF, be sure it's using an index to do it's
task. If not put an index on those columns. If one exists, but it's not
being used, kick it in the pants with a hint. If you're cursoring to get
your MEDIAN in the UDF, this should make that returned recordset come back
faster. Not knowing what else goes on in your UDF, I'm not sure what else to
suggest.
Here it is:
----------------------------------------
CREATE FUNCTION dbo.funcReportStatisticalMedianPriceByLevel2
(
@mItemID as int,
@mStartDate as smalldatetime,
@mEndDate as smalldatetime,
@mLevelTypeID as int,
@mLevelID as int
)
RETURNS float
AS
BEGIN

DECLARE @mtblItemPrices Table
(
mPrice float NOT NULL
)
DECLARE @mMedianPrice as float
DECLARE @mRecordCount as int
DECLARE @midx as int

INSERT INTO @mtblItemPrices
SELECT DISTINCT ItemPrice FROM dbo.Sales
WHERE dbo.Sales.ItemsID = @mItemID
AND
dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate
AND (CASE @mLevelTypeID
--the enterprise level, no filter applied
WHEN 1 THEN
1
--the region level, filter by RegionID
WHEN 2 THEN
CASE WHEN dbo.Sales.RegionID = @mLevelID THEN 1 ELSE NULL END
--the district level, filter by DistrictID
WHEN 3 THEN
CASE WHEN dbo.Sales.DistrictID = @mLevelID THEN 1 ELSE NULL END
--the municipality level, filter by MunicipalityID
WHEN 4 THEN
CASE WHEN dbo.Sales.MunicipalityID= @mLevelID THEN 1 ELSE NULL END
--the site level, filter by StoreID
WHEN 5 THEN
CASE WHEN dbo.Sales.StoreID = @mLevelID THEN 1 ELSE NULL END
END = 1)

ORDER BY ItemPrice ASC
SET @mRecordCount = (SELECT Count(*) FROM @mtblItemPrices)
IF @mRecordCount = 0
BEGIN
SET @mMedianPrice = NULL
END
ELSE
BEGIN

SET @mMedianPrice = (SELECT TOP 1 mPrice FROM
(SELECT TOP 50 PERCENT mPrice FROM

@mtblItemPrices
ORDER BY mPrice ASC) AS tmp
ORDER BY mPrice DESC)
END
RETURN @mMedianPrice
END
---------------------------------------------
Quote:
Ways to optimize your cube would be to partition it by something queried
often and load balanced. Time's usually a good candidate if the data isn't
too seasonal. Partitions are great for processing (multiple partitions can
be processed in parallel) and great for querying since smaller pieces have to
be evaluated. Be sure to set your slice on each partition, or performance
could actually be worse. In a partitioned cube, the MEDIAN could have many
threads going after smaller amounts of data so that could speed up this
dynamic work as well. Just something else to consider.
Another great suggestion. If we go further down the AS road we'll certainly
try it.

Quote:
Let me know if you have any questions on all this babbling. You have an
interesting challenge.

Good luck.

- Phil
Interesting indeed, and I'll take all the luck I can get. Your help here is
a most congenial bit of good luck in itself.

Hoisting imaginary pint on high, I remain, sir, &c
OM





Reply With Quote
  #6  
Old   
SQL McOLAP
 
Posts: n/a

Default RE: median as a derived measure - 09-14-2005 , 09:03 AM



Hi again, OM -

The sample MDX I was alluding to would be something like:

'IIF([Product].CurrentMember.Level.Ordinal >= [Product].[Product
Subcategory].Ordinal,
MEDIAN({[Product].CurrentMember.Children},[Measures].[Store Sales]),
NULL)'

I just used a FoodMart example here. In this calc member, if you have
product subcategories or any level below in the product dimension on your
axis, it will get the median of the current member's children. If not, it
won't even attempt it and give you a null.

Obviously, this is a watered down sample, you'd need to include multiple
dimension level evaluations in your calc, since you'd be looking for
time/product/location, but I think you get the idea from this.

Test the heck out of this. You may find that allowing this only at the
lower levels of your dims, it might return the median in a time that isn't
"that bad." Again, end users may not accept this if they want median at the
higher levels. But as you were alluding to, sometimes MEDIAN at higher
levels may not even make business sense. Something they'll let you know with
further discussion, I'm sure.

As far as your UDF, be sure you have an index on what you're doing your
select on, this part:

SELECT DISTINCT ItemPrice FROM dbo.Sales
WHERE dbo.Sales.ItemsID = @mItemID
AND
dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate

...an index on TransactionDate, ItemsID. Be sure it's being used by checking
the performance plan. If it's not, hint it. I have a feeling you've
probably already done this, but just to be sure.

Something else to get some more performance, do a SELECT INTO instead of an
INSERT INTO, to avoid the overhead of logging. When the UDF is done, you can
delete that table created by the SELECT INTO. If massive amounts of rows get
inserted at this point (which would happen at a higher level) you may want to
investigate array processing in VB or C# as an alternative.

Some food for thought.

Best of luck.

- Phil


"ObscuroMondo" wrote:

Quote:
Many thanks for your most excellent reply. "Babbling," ha! Reply inline.

"SQL McOLAP" wrote:

OM -

A question to ask the business. Is MEDIAN necessary? Is AVG completely out
of the question? Not to dodge the requirement, but I've consulted at places
where the business swore that MEDIAN was necessary, but after further
analysis found that the difference between AVG and MEDIAN was immaterial,
especially with larger datasets and "normal" data (where the STDEV wasn't
huge). If the business was willing to do some analysis on this data, they
may find the same thing. If so, a simple "count" column with just a "1" in
it could easily be put in a SQL view for your fact table, and then you could
create an easy calc member that divided the price by that count. This will
always be very fast in the cube, and you won't have to worry about different
levels or granularity.

They want both, but median has been the hard one as you point out. I am
going to do as you suggest and ask them "what benefit do you get from median
that you don't get from average?" You could give them every conceivable
statistical measure and I'm not sure it's going to help them much, given the
nature of the reports they supplied in prototype as Excel spreadsheets.
Their prototype reports are strictly static snapshots of measures calculated
over user-selectable time periods. To draw useful inferences from their
reports you have to generate a collection of instances of a given report, run
over various time periods, and then traverse the collection "by hand" looking
for interesting things.

They would be much better served by running a 3rd-party front end against
well-designed and implemented AS cubes, and I'm pushing that idea, but in the
meantime we need to generate their reports as quickly as possible.

Having said that, the following is an attempt to help you with MEDIAN.

Unfortunately, you're in a real "pay me now, or pay me later" scenario. If
you want MEDIAN to be even remotely performant in reporting over large
amounts of fact data, you'll probably have to dedicate some "up front"
processing to get that MEDIAN, and then store it. And this processing will
take time, as you're seeing.

some SQL McOLAP content elided for brevity

That's the general approach I've taken. We're trying to pipeline the
statistical calculations, even if it means giving up some numerical
precision. So we've been looking for ways for municipality, district, and
region servers to generate daily stats and include them in the data warehouse
dump each day. While taking the median of a price across Districts to
represent the Region median price is obviously not the same as really
calculating the Region median from the entire sample, is it possibly "close
enough for rock'n'roll?" Of course the business must decide, not us.

The above was a SQL solution. On the OLAP side, with your current SQL and
OLAP structure (without the summary tables described above), try finding a
"sweetspot." Allow the median to be done only at a certain level, if
possible. In other words, in an IIF statement in your calc member, only do
the MEDIAN if you're at day/store/product, or month/store/product. You may
find at lower levels the performance isn't "that bad," but test it out.
Maybe month/store/product isn't "that bad" either. I have a feeling, though,
if the data set is large, you might be stuck at anything higher than that
lowest level. Another obvious problem, is that the business may want to see
the median above that level as well. Let me know if you want some sample MDX
for this.

Yes! Yes! Please. That's one of our requirements.

A "combined" solution would be to create the SQL tables in the first
suggestion, and then build cubes off each of them, keeping all other dims the
same, but only enabling the levels germain to that cube for the 3 mentioned
dims. You could then join these cubes in a virtual cube.

some SQL McOLAP content elided for brevity

All this would only be necessary if OLAP reporting
has to be done. It sounds like you're already doing SQL reporting for this,
and if that's OK, reporting services reports against the summary tables might
suffice.


That's what I've been thinking too: why use Analysis Services if a) all we
want are these static reports and b) we can get satisfactory performance from
precalculated values? If we want to really USE the data then use AS in
conjunction with a front end. If we're going to burn cycles to precalculate
stuff in SQL, why burn more to process a cube for minimal performance gain if
we're not really using the cube?

I apologize for any disappointment caused by these suggestions, but I just
don't know of any tricks for MEDIAN, being the processing intensive stat that
it is.

No disappointment at all, believe me. Instead, a tremendous help.

Somethings to look at in your UDF, be sure it's using an index to do it's
task. If not put an index on those columns. If one exists, but it's not
being used, kick it in the pants with a hint. If you're cursoring to get
your MEDIAN in the UDF, this should make that returned recordset come back
faster. Not knowing what else goes on in your UDF, I'm not sure what else to
suggest.

Here it is:
----------------------------------------
CREATE FUNCTION dbo.funcReportStatisticalMedianPriceByLevel2
(
@mItemID as int,
@mStartDate as smalldatetime,
@mEndDate as smalldatetime,
@mLevelTypeID as int,
@mLevelID as int
)
RETURNS float
AS
BEGIN

DECLARE @mtblItemPrices Table
(
mPrice float NOT NULL
)
DECLARE @mMedianPrice as float
DECLARE @mRecordCount as int
DECLARE @midx as int

INSERT INTO @mtblItemPrices
SELECT DISTINCT ItemPrice FROM dbo.Sales
WHERE dbo.Sales.ItemsID = @mItemID
AND
dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate
AND (CASE @mLevelTypeID
--the enterprise level, no filter applied
WHEN 1 THEN
1
--the region level, filter by RegionID
WHEN 2 THEN
CASE WHEN dbo.Sales.RegionID = @mLevelID THEN 1 ELSE NULL END
--the district level, filter by DistrictID
WHEN 3 THEN
CASE WHEN dbo.Sales.DistrictID = @mLevelID THEN 1 ELSE NULL END
--the municipality level, filter by MunicipalityID
WHEN 4 THEN
CASE WHEN dbo.Sales.MunicipalityID= @mLevelID THEN 1 ELSE NULL END
--the site level, filter by StoreID
WHEN 5 THEN
CASE WHEN dbo.Sales.StoreID = @mLevelID THEN 1 ELSE NULL END
END = 1)

ORDER BY ItemPrice ASC
SET @mRecordCount = (SELECT Count(*) FROM @mtblItemPrices)
IF @mRecordCount = 0
BEGIN
SET @mMedianPrice = NULL
END
ELSE
BEGIN

SET @mMedianPrice = (SELECT TOP 1 mPrice FROM
(SELECT TOP 50 PERCENT mPrice FROM

@mtblItemPrices
ORDER BY mPrice ASC) AS tmp
ORDER BY mPrice DESC)
END
RETURN @mMedianPrice
END
---------------------------------------------

Ways to optimize your cube would be to partition it by something queried
often and load balanced. Time's usually a good candidate if the data isn't
too seasonal. Partitions are great for processing (multiple partitions can
be processed in parallel) and great for querying since smaller pieces have to
be evaluated. Be sure to set your slice on each partition, or performance
could actually be worse. In a partitioned cube, the MEDIAN could have many
threads going after smaller amounts of data so that could speed up this
dynamic work as well. Just something else to consider.

Another great suggestion. If we go further down the AS road we'll certainly
try it.

Let me know if you have any questions on all this babbling. You have an
interesting challenge.

Good luck.

- Phil

Interesting indeed, and I'll take all the luck I can get. Your help here is
a most congenial bit of good luck in itself.

Hoisting imaginary pint on high, I remain, sir, &c
OM




Reply With Quote
  #7  
Old   
SQL McOLAP
 
Posts: n/a

Default RE: median as a derived measure - 09-14-2005 , 09:11 AM



Sorry, here's the MDX I meant to post:

with member [measures].[Median Child Sales] as
'IIF([Product].CurrentMember.Level.Ordinal >= [Product].[Product
Subcategory].Ordinal, MEDIAN({Descendants (
[Product].CurrentMember,[Product].[Product Name])},[Measures].[Store Sales]),
NULL)'

....the previous example didn't take the descendants at the lowest level, as
your requirement is.

My apologies.

- Phil



"SQL McOLAP" wrote:

Quote:
Hi again, OM -

The sample MDX I was alluding to would be something like:

'IIF([Product].CurrentMember.Level.Ordinal >= [Product].[Product
Subcategory].Ordinal,
MEDIAN({[Product].CurrentMember.Children},[Measures].[Store Sales]),
NULL)'

I just used a FoodMart example here. In this calc member, if you have
product subcategories or any level below in the product dimension on your
axis, it will get the median of the current member's children. If not, it
won't even attempt it and give you a null.

Obviously, this is a watered down sample, you'd need to include multiple
dimension level evaluations in your calc, since you'd be looking for
time/product/location, but I think you get the idea from this.

Test the heck out of this. You may find that allowing this only at the
lower levels of your dims, it might return the median in a time that isn't
"that bad." Again, end users may not accept this if they want median at the
higher levels. But as you were alluding to, sometimes MEDIAN at higher
levels may not even make business sense. Something they'll let you know with
further discussion, I'm sure.

As far as your UDF, be sure you have an index on what you're doing your
select on, this part:

SELECT DISTINCT ItemPrice FROM dbo.Sales
WHERE dbo.Sales.ItemsID = @mItemID
AND
dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate

..an index on TransactionDate, ItemsID. Be sure it's being used by checking
the performance plan. If it's not, hint it. I have a feeling you've
probably already done this, but just to be sure.

Something else to get some more performance, do a SELECT INTO instead of an
INSERT INTO, to avoid the overhead of logging. When the UDF is done, you can
delete that table created by the SELECT INTO. If massive amounts of rows get
inserted at this point (which would happen at a higher level) you may want to
investigate array processing in VB or C# as an alternative.

Some food for thought.

Best of luck.

- Phil


"ObscuroMondo" wrote:

Many thanks for your most excellent reply. "Babbling," ha! Reply inline.

"SQL McOLAP" wrote:

OM -

A question to ask the business. Is MEDIAN necessary? Is AVG completely out
of the question? Not to dodge the requirement, but I've consulted at places
where the business swore that MEDIAN was necessary, but after further
analysis found that the difference between AVG and MEDIAN was immaterial,
especially with larger datasets and "normal" data (where the STDEV wasn't
huge). If the business was willing to do some analysis on this data, they
may find the same thing. If so, a simple "count" column with just a "1" in
it could easily be put in a SQL view for your fact table, and then you could
create an easy calc member that divided the price by that count. This will
always be very fast in the cube, and you won't have to worry about different
levels or granularity.

They want both, but median has been the hard one as you point out. I am
going to do as you suggest and ask them "what benefit do you get from median
that you don't get from average?" You could give them every conceivable
statistical measure and I'm not sure it's going to help them much, given the
nature of the reports they supplied in prototype as Excel spreadsheets.
Their prototype reports are strictly static snapshots of measures calculated
over user-selectable time periods. To draw useful inferences from their
reports you have to generate a collection of instances of a given report, run
over various time periods, and then traverse the collection "by hand" looking
for interesting things.

They would be much better served by running a 3rd-party front end against
well-designed and implemented AS cubes, and I'm pushing that idea, but in the
meantime we need to generate their reports as quickly as possible.

Having said that, the following is an attempt to help you with MEDIAN.

Unfortunately, you're in a real "pay me now, or pay me later" scenario. If
you want MEDIAN to be even remotely performant in reporting over large
amounts of fact data, you'll probably have to dedicate some "up front"
processing to get that MEDIAN, and then store it. And this processing will
take time, as you're seeing.

some SQL McOLAP content elided for brevity

That's the general approach I've taken. We're trying to pipeline the
statistical calculations, even if it means giving up some numerical
precision. So we've been looking for ways for municipality, district, and
region servers to generate daily stats and include them in the data warehouse
dump each day. While taking the median of a price across Districts to
represent the Region median price is obviously not the same as really
calculating the Region median from the entire sample, is it possibly "close
enough for rock'n'roll?" Of course the business must decide, not us.

The above was a SQL solution. On the OLAP side, with your current SQL and
OLAP structure (without the summary tables described above), try finding a
"sweetspot." Allow the median to be done only at a certain level, if
possible. In other words, in an IIF statement in your calc member, only do
the MEDIAN if you're at day/store/product, or month/store/product. You may
find at lower levels the performance isn't "that bad," but test it out.
Maybe month/store/product isn't "that bad" either. I have a feeling, though,
if the data set is large, you might be stuck at anything higher than that
lowest level. Another obvious problem, is that the business may want to see
the median above that level as well. Let me know if you want some sample MDX
for this.

Yes! Yes! Please. That's one of our requirements.

A "combined" solution would be to create the SQL tables in the first
suggestion, and then build cubes off each of them, keeping all other dims the
same, but only enabling the levels germain to that cube for the 3 mentioned
dims. You could then join these cubes in a virtual cube.

some SQL McOLAP content elided for brevity

All this would only be necessary if OLAP reporting
has to be done. It sounds like you're already doing SQL reporting for this,
and if that's OK, reporting services reports against the summary tables might
suffice.


That's what I've been thinking too: why use Analysis Services if a) all we
want are these static reports and b) we can get satisfactory performance from
precalculated values? If we want to really USE the data then use AS in
conjunction with a front end. If we're going to burn cycles to precalculate
stuff in SQL, why burn more to process a cube for minimal performance gain if
we're not really using the cube?

I apologize for any disappointment caused by these suggestions, but I just
don't know of any tricks for MEDIAN, being the processing intensive stat that
it is.

No disappointment at all, believe me. Instead, a tremendous help.

Somethings to look at in your UDF, be sure it's using an index to do it's
task. If not put an index on those columns. If one exists, but it's not
being used, kick it in the pants with a hint. If you're cursoring to get
your MEDIAN in the UDF, this should make that returned recordset come back
faster. Not knowing what else goes on in your UDF, I'm not sure what else to
suggest.

Here it is:
----------------------------------------
CREATE FUNCTION dbo.funcReportStatisticalMedianPriceByLevel2
(
@mItemID as int,
@mStartDate as smalldatetime,
@mEndDate as smalldatetime,
@mLevelTypeID as int,
@mLevelID as int
)
RETURNS float
AS
BEGIN

DECLARE @mtblItemPrices Table
(
mPrice float NOT NULL
)
DECLARE @mMedianPrice as float
DECLARE @mRecordCount as int
DECLARE @midx as int

INSERT INTO @mtblItemPrices
SELECT DISTINCT ItemPrice FROM dbo.Sales
WHERE dbo.Sales.ItemsID = @mItemID
AND
dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate
AND (CASE @mLevelTypeID
--the enterprise level, no filter applied
WHEN 1 THEN
1
--the region level, filter by RegionID
WHEN 2 THEN
CASE WHEN dbo.Sales.RegionID = @mLevelID THEN 1 ELSE NULL END
--the district level, filter by DistrictID
WHEN 3 THEN
CASE WHEN dbo.Sales.DistrictID = @mLevelID THEN 1 ELSE NULL END
--the municipality level, filter by MunicipalityID
WHEN 4 THEN
CASE WHEN dbo.Sales.MunicipalityID= @mLevelID THEN 1 ELSE NULL END
--the site level, filter by StoreID
WHEN 5 THEN
CASE WHEN dbo.Sales.StoreID = @mLevelID THEN 1 ELSE NULL END
END = 1)

ORDER BY ItemPrice ASC
SET @mRecordCount = (SELECT Count(*) FROM @mtblItemPrices)
IF @mRecordCount = 0
BEGIN
SET @mMedianPrice = NULL
END
ELSE
BEGIN

SET @mMedianPrice = (SELECT TOP 1 mPrice FROM
(SELECT TOP 50 PERCENT mPrice FROM

@mtblItemPrices
ORDER BY mPrice ASC) AS tmp
ORDER BY mPrice DESC)
END
RETURN @mMedianPrice
END
---------------------------------------------

Ways to optimize your cube would be to partition it by something queried
often and load balanced. Time's usually a good candidate if the data isn't
too seasonal. Partitions are great for processing (multiple partitions can
be processed in parallel) and great for querying since smaller pieces have to
be evaluated. Be sure to set your slice on each partition, or performance
could actually be worse. In a partitioned cube, the MEDIAN could have many
threads going after smaller amounts of data so that could speed up this
dynamic work as well. Just something else to consider.

Another great suggestion. If we go further down the AS road we'll certainly
try it.

Let me know if you have any questions on all this babbling. You have an
interesting challenge.

Good luck.

- Phil

Interesting indeed, and I'll take all the luck I can get. Your help here is
a most congenial bit of good luck in itself.

Hoisting imaginary pint on high, I remain, sir, &c
OM




Reply With Quote
  #8  
Old   
ObscuroMondo
 
Posts: n/a

Default RE: median as a derived measure - 09-14-2005 , 10:01 AM



Mr. McOLAP, I am in your debt. This has been the most productive newsgroup
discussion in many a day. Also, no apologies required, my word. Many
thanks for your trenchant suggestions. Reply inline.

<some SQL McOLAP content elided for brevity>

<appended SQL McOLAP content from followon post>

Quote:
Sorry, here's the MDX I meant to post:

with member [measures].[Median Child Sales] as
'IIF([Product].CurrentMember.Level.Ordinal >= [Product].[Product
Subcategory].Ordinal, MEDIAN({Descendants (
[Product].CurrentMember,[Product].[Product Name])},[Measures].[Store Sales]),
NULL)'

...the previous example didn't take the descendants at the lowest level, as
your requirement is.

My apologies.
Thanks! Now implementing this version and we'll see how it goes.

</appended SQL McOLAP content from followon post>

Quote:
But as you were alluding to, sometimes MEDIAN at higher
levels may not even make business sense. Something they'll let you know with
further discussion, I'm sure.
From what I've seen of their data, average will tell them what they NEED to
know. We may all WANT median but it may be too expensive an adornment at
this point.

Quote:
As far as your UDF, be sure you have an index on what you're doing your
select on, this part:

SELECT DISTINCT ItemPrice FROM dbo.Sales
WHERE dbo.Sales.ItemsID = @mItemID
AND
dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate

..an index on TransactionDate, ItemsID. Be sure it's being used by checking
the performance plan. If it's not, hint it. I have a feeling you've
probably already done this, but just to be sure.
Blush to admit it, but yes and no. We had separate indices on each column
but not a composite, and the query plan was not using them. Created a new
composite index and am now testing to see if we get a performance increase.
Each run takes a while. A great suggestion.

Quote:
Something else to get some more performance, do a SELECT INTO instead of an
INSERT INTO, to avoid the overhead of logging. When the UDF is done, you can
delete that table created by the SELECT INTO. If massive amounts of rows get
inserted at this point (which would happen at a higher level) you may want to
investigate array processing in VB or C# as an alternative.

Some food for thought.
Also now in testing. Results in a bit.

Quote:
Best of luck.

- Phil
And to you also.

With gratitude
OM



Reply With Quote
  #9  
Old   
ObscuroMondo
 
Posts: n/a

Default RE: median as a derived measure - 09-19-2005 , 09:00 AM



One final note: research seems to indicate that scalar UDFs used in SELECT or
WHERE clauses have the effect of forcing serial execution. On a
multiprocessor box, this means that no matter how complex the query, if it
uses UDFs, it will be limited to running on a single processor.

This is not an Analysis Services issue per se, but may be helpful to others
seeking the best performance from their systems.

Google Groups thread:

http://groups.google.com/group/micro...200872c184e263

Thanks again
OM

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.