![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Greetings all, I'm an AS n00b and my ignorance is profound. Hope somebody here can help me out. We're trying to speed up generation of sales reports for a run-of-the-mill POS system, nothing fancy. We're using Reporting Services and are now calculating median item selling price at report runtime by means of a T-SQL UDF. This UDF calculates the median price at each level of the org unit hierarchy: store, municipality, district, region, and enterprise. With millions of POS transaction records and thousands of inventory items, this takes a long, long time and users are complaining. We hoped that by basing the reports on results obtained from Analysis Services we could move from calculating the median at runtime to getting it from a derived measure calculated at cube processing time. We have a sort-of solution for a calculated measure although only the leaf level gets the correct result: Median(NonEmptyCrossJoin(Descendants([Sales].CurrentMember,,LEAVES), Descendants([Stores].CurrentMember,,LEAVES)), [Measures].[Item Price]) Based on my experimentation, only T-SQL functions can appear as part of a derived measure. SUM, AVG and the like are permitted, but UDFs cannot be used in the definition of a derived member, correct? We want to use AS to precalculate median and other statistical measures so that we're only doing lookups at report generation time. Can this be done? If so, how? Many thanks for your time and help OM |
#3
| |||||||
| |||||||
|
|
OM - Median is tough no matter what the technology is because of what it has to do. All detail records have to be sorted first, and then it has to find the middle record's value (avg two records if two are in the middle). |
|
Unlike Oracle, there's no native MEDIAN function in SQL Server, but I've seen MEDIAN be pretty slow in large Oracle tables as well. I'm guessing your UDF is doint all the MEDIAN work. |
|
My questions to start are this: Do you want the median of the item's price at the lowest level (store), no matter what? |
|
And is time part of this equation as well? Does it have to be by day, too? I'm guessing you would because if you took it at a higher level and it's a sum aggregate, those prices would be summed up the hierarchy, and I'm guessing that would be incorrect. |
|
The problem you'll have with this, is that the higher up in the hierarchies you go, the more records will have to be dynamically sorted for the MEDIAN function to process, and this could really chug. |
|
Also, what is your "Sales" dimension here? My hope is that you don't have an invoice or account type dimension, because they really aren't beneficial for OLAP analysis. Is it the product dimension you mentioned? |
|
There are a couple of things I'd suggest, some on the SQL side, some on the OLAP side, and a weird combination if my hunch follows what you're doing. Let me know the answers to the above questions, and I'll let you know what I was thinking. - Phil |
#4
| |||
| |||
|
|
"SQL McOLAP" wrote: /* outstanding moniker */ OM - Median is tough no matter what the technology is because of what it has to do. All detail records have to be sorted first, and then it has to find the middle record's value (avg two records if two are in the middle). Many thanks for the rapid reply, Phil. I've watched our UDF run on a dev PC, over a measly 100,000 transaction records and 2,000 SKUs, for 22 minutes. It's a lot of work. Unlike Oracle, there's no native MEDIAN function in SQL Server, but I've seen MEDIAN be pretty slow in large Oracle tables as well. I'm guessing your UDF is doint all the MEDIAN work. You are correct, sir. Me and the rest of the world wish there was a native median in T-SQL. My questions to start are this: Do you want the median of the item's price at the lowest level (store), no matter what? Yes. And is time part of this equation as well? Does it have to be by day, too? I'm guessing you would because if you took it at a higher level and it's a sum aggregate, those prices would be summed up the hierarchy, and I'm guessing that would be incorrect. Yes, and yes. Prices can vary arbitrarily at the store level even within the same day. You are also right about the incorrect aggregation, which is why I wanted to calculate the median before the aggregations are calculated. The problem you'll have with this, is that the higher up in the hierarchies you go, the more records will have to be dynamically sorted for the MEDIAN function to process, and this could really chug. That is true, and our current org-level-aware UDF is really really slow as noted above. You watch that little green Reporting Services pinwheel going around for quite a while, even for small datasets. For real datasets on a dev box you could go to the Bahamas and back by the time it's done. Also, what is your "Sales" dimension here? My hope is that you don't have an invoice or account type dimension, because they really aren't beneficial for OLAP analysis. Is it the product dimension you mentioned? Each record in the Sales dimension represents a single line item from a point-of-sale transaction. So for example if you sell 6 1/4-20 2" bolts, you get a record with the price of a single bolt and the quantity sold, 6 in this case. This business model is a little maddening because there is no control (none!) over prices at any but the Store (leaf) level of the org unit hierarchy, and managers can change prices at any time. Just because you sold a 1/4-20 bolt for $0.25 10 minutes ago doesn't mean that's what it costs right now. So you have to get your pricing information from the actual individual sales transactions. Lotta work, even if the finest Time granularity is Day, which it is. I have in essence made the fact table into a dimension, so I can navigate the sales data. Seemed like a good idea at the time because of the wildly varying per-item pricing. Will happily admit to bad thinking and ignorance, just want to get the thing running at acceptable speed, or be convinced that It Can't Be Done. There are a couple of things I'd suggest, some on the SQL side, some on the OLAP side, and a weird combination if my hunch follows what you're doing. Let me know the answers to the above questions, and I'll let you know what I was thinking. - Phil Many thanks once again OM |
#5
| |||||||||
| |||||||||
|
|
OM - A question to ask the business. Is MEDIAN necessary? Is AVG completely out of the question? Not to dodge the requirement, but I've consulted at places where the business swore that MEDIAN was necessary, but after further analysis found that the difference between AVG and MEDIAN was immaterial, especially with larger datasets and "normal" data (where the STDEV wasn't huge). If the business was willing to do some analysis on this data, they may find the same thing. If so, a simple "count" column with just a "1" in it could easily be put in a SQL view for your fact table, and then you could create an easy calc member that divided the price by that count. This will always be very fast in the cube, and you won't have to worry about different levels or granularity. |
|
Having said that, the following is an attempt to help you with MEDIAN. Unfortunately, you're in a real "pay me now, or pay me later" scenario. If you want MEDIAN to be even remotely performant in reporting over large amounts of fact data, you'll probably have to dedicate some "up front" processing to get that MEDIAN, and then store it. And this processing will take time, as you're seeing. |
|
The above was a SQL solution. On the OLAP side, with your current SQL and OLAP structure (without the summary tables described above), try finding a "sweetspot." Allow the median to be done only at a certain level, if possible. In other words, in an IIF statement in your calc member, only do the MEDIAN if you're at day/store/product, or month/store/product. You may find at lower levels the performance isn't "that bad," but test it out. Maybe month/store/product isn't "that bad" either. I have a feeling, though, if the data set is large, you might be stuck at anything higher than that lowest level. Another obvious problem, is that the business may want to see the median above that level as well. Let me know if you want some sample MDX for this. |
|
A "combined" solution would be to create the SQL tables in the first suggestion, and then build cubes off each of them, keeping all other dims the same, but only enabling the levels germain to that cube for the 3 mentioned dims. You could then join these cubes in a virtual cube. |
|
All this would only be necessary if OLAP reporting has to be done. It sounds like you're already doing SQL reporting for this, and if that's OK, reporting services reports against the summary tables might suffice. |
|
I apologize for any disappointment caused by these suggestions, but I just don't know of any tricks for MEDIAN, being the processing intensive stat that it is. |
|
Somethings to look at in your UDF, be sure it's using an index to do it's task. If not put an index on those columns. If one exists, but it's not being used, kick it in the pants with a hint. If you're cursoring to get your MEDIAN in the UDF, this should make that returned recordset come back faster. Not knowing what else goes on in your UDF, I'm not sure what else to suggest. |
|
Ways to optimize your cube would be to partition it by something queried often and load balanced. Time's usually a good candidate if the data isn't too seasonal. Partitions are great for processing (multiple partitions can be processed in parallel) and great for querying since smaller pieces have to be evaluated. Be sure to set your slice on each partition, or performance could actually be worse. In a partitioned cube, the MEDIAN could have many threads going after smaller amounts of data so that could speed up this dynamic work as well. Just something else to consider. |
|
Let me know if you have any questions on all this babbling. You have an interesting challenge. Good luck. - Phil |
#6
| |||
| |||
|
|
Many thanks for your most excellent reply. "Babbling," ha! Reply inline. "SQL McOLAP" wrote: OM - A question to ask the business. Is MEDIAN necessary? Is AVG completely out of the question? Not to dodge the requirement, but I've consulted at places where the business swore that MEDIAN was necessary, but after further analysis found that the difference between AVG and MEDIAN was immaterial, especially with larger datasets and "normal" data (where the STDEV wasn't huge). If the business was willing to do some analysis on this data, they may find the same thing. If so, a simple "count" column with just a "1" in it could easily be put in a SQL view for your fact table, and then you could create an easy calc member that divided the price by that count. This will always be very fast in the cube, and you won't have to worry about different levels or granularity. They want both, but median has been the hard one as you point out. I am going to do as you suggest and ask them "what benefit do you get from median that you don't get from average?" You could give them every conceivable statistical measure and I'm not sure it's going to help them much, given the nature of the reports they supplied in prototype as Excel spreadsheets. Their prototype reports are strictly static snapshots of measures calculated over user-selectable time periods. To draw useful inferences from their reports you have to generate a collection of instances of a given report, run over various time periods, and then traverse the collection "by hand" looking for interesting things. They would be much better served by running a 3rd-party front end against well-designed and implemented AS cubes, and I'm pushing that idea, but in the meantime we need to generate their reports as quickly as possible. Having said that, the following is an attempt to help you with MEDIAN. Unfortunately, you're in a real "pay me now, or pay me later" scenario. If you want MEDIAN to be even remotely performant in reporting over large amounts of fact data, you'll probably have to dedicate some "up front" processing to get that MEDIAN, and then store it. And this processing will take time, as you're seeing. some SQL McOLAP content elided for brevity That's the general approach I've taken. We're trying to pipeline the statistical calculations, even if it means giving up some numerical precision. So we've been looking for ways for municipality, district, and region servers to generate daily stats and include them in the data warehouse dump each day. While taking the median of a price across Districts to represent the Region median price is obviously not the same as really calculating the Region median from the entire sample, is it possibly "close enough for rock'n'roll?" Of course the business must decide, not us. The above was a SQL solution. On the OLAP side, with your current SQL and OLAP structure (without the summary tables described above), try finding a "sweetspot." Allow the median to be done only at a certain level, if possible. In other words, in an IIF statement in your calc member, only do the MEDIAN if you're at day/store/product, or month/store/product. You may find at lower levels the performance isn't "that bad," but test it out. Maybe month/store/product isn't "that bad" either. I have a feeling, though, if the data set is large, you might be stuck at anything higher than that lowest level. Another obvious problem, is that the business may want to see the median above that level as well. Let me know if you want some sample MDX for this. Yes! Yes! Please. That's one of our requirements. A "combined" solution would be to create the SQL tables in the first suggestion, and then build cubes off each of them, keeping all other dims the same, but only enabling the levels germain to that cube for the 3 mentioned dims. You could then join these cubes in a virtual cube. some SQL McOLAP content elided for brevity All this would only be necessary if OLAP reporting has to be done. It sounds like you're already doing SQL reporting for this, and if that's OK, reporting services reports against the summary tables might suffice. That's what I've been thinking too: why use Analysis Services if a) all we want are these static reports and b) we can get satisfactory performance from precalculated values? If we want to really USE the data then use AS in conjunction with a front end. If we're going to burn cycles to precalculate stuff in SQL, why burn more to process a cube for minimal performance gain if we're not really using the cube? I apologize for any disappointment caused by these suggestions, but I just don't know of any tricks for MEDIAN, being the processing intensive stat that it is. No disappointment at all, believe me. Instead, a tremendous help. Somethings to look at in your UDF, be sure it's using an index to do it's task. If not put an index on those columns. If one exists, but it's not being used, kick it in the pants with a hint. If you're cursoring to get your MEDIAN in the UDF, this should make that returned recordset come back faster. Not knowing what else goes on in your UDF, I'm not sure what else to suggest. Here it is: ---------------------------------------- CREATE FUNCTION dbo.funcReportStatisticalMedianPriceByLevel2 ( @mItemID as int, @mStartDate as smalldatetime, @mEndDate as smalldatetime, @mLevelTypeID as int, @mLevelID as int ) RETURNS float AS BEGIN DECLARE @mtblItemPrices Table ( mPrice float NOT NULL ) DECLARE @mMedianPrice as float DECLARE @mRecordCount as int DECLARE @midx as int INSERT INTO @mtblItemPrices SELECT DISTINCT ItemPrice FROM dbo.Sales WHERE dbo.Sales.ItemsID = @mItemID AND dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate AND (CASE @mLevelTypeID --the enterprise level, no filter applied WHEN 1 THEN 1 --the region level, filter by RegionID WHEN 2 THEN CASE WHEN dbo.Sales.RegionID = @mLevelID THEN 1 ELSE NULL END --the district level, filter by DistrictID WHEN 3 THEN CASE WHEN dbo.Sales.DistrictID = @mLevelID THEN 1 ELSE NULL END --the municipality level, filter by MunicipalityID WHEN 4 THEN CASE WHEN dbo.Sales.MunicipalityID= @mLevelID THEN 1 ELSE NULL END --the site level, filter by StoreID WHEN 5 THEN CASE WHEN dbo.Sales.StoreID = @mLevelID THEN 1 ELSE NULL END END = 1) ORDER BY ItemPrice ASC SET @mRecordCount = (SELECT Count(*) FROM @mtblItemPrices) IF @mRecordCount = 0 BEGIN SET @mMedianPrice = NULL END ELSE BEGIN SET @mMedianPrice = (SELECT TOP 1 mPrice FROM (SELECT TOP 50 PERCENT mPrice FROM @mtblItemPrices ORDER BY mPrice ASC) AS tmp ORDER BY mPrice DESC) END RETURN @mMedianPrice END --------------------------------------------- Ways to optimize your cube would be to partition it by something queried often and load balanced. Time's usually a good candidate if the data isn't too seasonal. Partitions are great for processing (multiple partitions can be processed in parallel) and great for querying since smaller pieces have to be evaluated. Be sure to set your slice on each partition, or performance could actually be worse. In a partitioned cube, the MEDIAN could have many threads going after smaller amounts of data so that could speed up this dynamic work as well. Just something else to consider. Another great suggestion. If we go further down the AS road we'll certainly try it. Let me know if you have any questions on all this babbling. You have an interesting challenge. Good luck. - Phil Interesting indeed, and I'll take all the luck I can get. Your help here is a most congenial bit of good luck in itself. Hoisting imaginary pint on high, I remain, sir, &c OM |
#7
| |||
| |||
|
|
Hi again, OM - The sample MDX I was alluding to would be something like: 'IIF([Product].CurrentMember.Level.Ordinal >= [Product].[Product Subcategory].Ordinal, MEDIAN({[Product].CurrentMember.Children},[Measures].[Store Sales]), NULL)' I just used a FoodMart example here. In this calc member, if you have product subcategories or any level below in the product dimension on your axis, it will get the median of the current member's children. If not, it won't even attempt it and give you a null. Obviously, this is a watered down sample, you'd need to include multiple dimension level evaluations in your calc, since you'd be looking for time/product/location, but I think you get the idea from this. Test the heck out of this. You may find that allowing this only at the lower levels of your dims, it might return the median in a time that isn't "that bad." Again, end users may not accept this if they want median at the higher levels. But as you were alluding to, sometimes MEDIAN at higher levels may not even make business sense. Something they'll let you know with further discussion, I'm sure. As far as your UDF, be sure you have an index on what you're doing your select on, this part: SELECT DISTINCT ItemPrice FROM dbo.Sales WHERE dbo.Sales.ItemsID = @mItemID AND dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate ..an index on TransactionDate, ItemsID. Be sure it's being used by checking the performance plan. If it's not, hint it. I have a feeling you've probably already done this, but just to be sure. Something else to get some more performance, do a SELECT INTO instead of an INSERT INTO, to avoid the overhead of logging. When the UDF is done, you can delete that table created by the SELECT INTO. If massive amounts of rows get inserted at this point (which would happen at a higher level) you may want to investigate array processing in VB or C# as an alternative. Some food for thought. Best of luck. - Phil "ObscuroMondo" wrote: Many thanks for your most excellent reply. "Babbling," ha! Reply inline. "SQL McOLAP" wrote: OM - A question to ask the business. Is MEDIAN necessary? Is AVG completely out of the question? Not to dodge the requirement, but I've consulted at places where the business swore that MEDIAN was necessary, but after further analysis found that the difference between AVG and MEDIAN was immaterial, especially with larger datasets and "normal" data (where the STDEV wasn't huge). If the business was willing to do some analysis on this data, they may find the same thing. If so, a simple "count" column with just a "1" in it could easily be put in a SQL view for your fact table, and then you could create an easy calc member that divided the price by that count. This will always be very fast in the cube, and you won't have to worry about different levels or granularity. They want both, but median has been the hard one as you point out. I am going to do as you suggest and ask them "what benefit do you get from median that you don't get from average?" You could give them every conceivable statistical measure and I'm not sure it's going to help them much, given the nature of the reports they supplied in prototype as Excel spreadsheets. Their prototype reports are strictly static snapshots of measures calculated over user-selectable time periods. To draw useful inferences from their reports you have to generate a collection of instances of a given report, run over various time periods, and then traverse the collection "by hand" looking for interesting things. They would be much better served by running a 3rd-party front end against well-designed and implemented AS cubes, and I'm pushing that idea, but in the meantime we need to generate their reports as quickly as possible. Having said that, the following is an attempt to help you with MEDIAN. Unfortunately, you're in a real "pay me now, or pay me later" scenario. If you want MEDIAN to be even remotely performant in reporting over large amounts of fact data, you'll probably have to dedicate some "up front" processing to get that MEDIAN, and then store it. And this processing will take time, as you're seeing. some SQL McOLAP content elided for brevity That's the general approach I've taken. We're trying to pipeline the statistical calculations, even if it means giving up some numerical precision. So we've been looking for ways for municipality, district, and region servers to generate daily stats and include them in the data warehouse dump each day. While taking the median of a price across Districts to represent the Region median price is obviously not the same as really calculating the Region median from the entire sample, is it possibly "close enough for rock'n'roll?" Of course the business must decide, not us. The above was a SQL solution. On the OLAP side, with your current SQL and OLAP structure (without the summary tables described above), try finding a "sweetspot." Allow the median to be done only at a certain level, if possible. In other words, in an IIF statement in your calc member, only do the MEDIAN if you're at day/store/product, or month/store/product. You may find at lower levels the performance isn't "that bad," but test it out. Maybe month/store/product isn't "that bad" either. I have a feeling, though, if the data set is large, you might be stuck at anything higher than that lowest level. Another obvious problem, is that the business may want to see the median above that level as well. Let me know if you want some sample MDX for this. Yes! Yes! Please. That's one of our requirements. A "combined" solution would be to create the SQL tables in the first suggestion, and then build cubes off each of them, keeping all other dims the same, but only enabling the levels germain to that cube for the 3 mentioned dims. You could then join these cubes in a virtual cube. some SQL McOLAP content elided for brevity All this would only be necessary if OLAP reporting has to be done. It sounds like you're already doing SQL reporting for this, and if that's OK, reporting services reports against the summary tables might suffice. That's what I've been thinking too: why use Analysis Services if a) all we want are these static reports and b) we can get satisfactory performance from precalculated values? If we want to really USE the data then use AS in conjunction with a front end. If we're going to burn cycles to precalculate stuff in SQL, why burn more to process a cube for minimal performance gain if we're not really using the cube? I apologize for any disappointment caused by these suggestions, but I just don't know of any tricks for MEDIAN, being the processing intensive stat that it is. No disappointment at all, believe me. Instead, a tremendous help. Somethings to look at in your UDF, be sure it's using an index to do it's task. If not put an index on those columns. If one exists, but it's not being used, kick it in the pants with a hint. If you're cursoring to get your MEDIAN in the UDF, this should make that returned recordset come back faster. Not knowing what else goes on in your UDF, I'm not sure what else to suggest. Here it is: ---------------------------------------- CREATE FUNCTION dbo.funcReportStatisticalMedianPriceByLevel2 ( @mItemID as int, @mStartDate as smalldatetime, @mEndDate as smalldatetime, @mLevelTypeID as int, @mLevelID as int ) RETURNS float AS BEGIN DECLARE @mtblItemPrices Table ( mPrice float NOT NULL ) DECLARE @mMedianPrice as float DECLARE @mRecordCount as int DECLARE @midx as int INSERT INTO @mtblItemPrices SELECT DISTINCT ItemPrice FROM dbo.Sales WHERE dbo.Sales.ItemsID = @mItemID AND dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate AND (CASE @mLevelTypeID --the enterprise level, no filter applied WHEN 1 THEN 1 --the region level, filter by RegionID WHEN 2 THEN CASE WHEN dbo.Sales.RegionID = @mLevelID THEN 1 ELSE NULL END --the district level, filter by DistrictID WHEN 3 THEN CASE WHEN dbo.Sales.DistrictID = @mLevelID THEN 1 ELSE NULL END --the municipality level, filter by MunicipalityID WHEN 4 THEN CASE WHEN dbo.Sales.MunicipalityID= @mLevelID THEN 1 ELSE NULL END --the site level, filter by StoreID WHEN 5 THEN CASE WHEN dbo.Sales.StoreID = @mLevelID THEN 1 ELSE NULL END END = 1) ORDER BY ItemPrice ASC SET @mRecordCount = (SELECT Count(*) FROM @mtblItemPrices) IF @mRecordCount = 0 BEGIN SET @mMedianPrice = NULL END ELSE BEGIN SET @mMedianPrice = (SELECT TOP 1 mPrice FROM (SELECT TOP 50 PERCENT mPrice FROM @mtblItemPrices ORDER BY mPrice ASC) AS tmp ORDER BY mPrice DESC) END RETURN @mMedianPrice END --------------------------------------------- Ways to optimize your cube would be to partition it by something queried often and load balanced. Time's usually a good candidate if the data isn't too seasonal. Partitions are great for processing (multiple partitions can be processed in parallel) and great for querying since smaller pieces have to be evaluated. Be sure to set your slice on each partition, or performance could actually be worse. In a partitioned cube, the MEDIAN could have many threads going after smaller amounts of data so that could speed up this dynamic work as well. Just something else to consider. Another great suggestion. If we go further down the AS road we'll certainly try it. Let me know if you have any questions on all this babbling. You have an interesting challenge. Good luck. - Phil Interesting indeed, and I'll take all the luck I can get. Your help here is a most congenial bit of good luck in itself. Hoisting imaginary pint on high, I remain, sir, &c OM |
#8
| |||||
| |||||
|
|
Sorry, here's the MDX I meant to post: with member [measures].[Median Child Sales] as 'IIF([Product].CurrentMember.Level.Ordinal >= [Product].[Product Subcategory].Ordinal, MEDIAN({Descendants ( [Product].CurrentMember,[Product].[Product Name])},[Measures].[Store Sales]), NULL)' ...the previous example didn't take the descendants at the lowest level, as your requirement is. My apologies. |
|
But as you were alluding to, sometimes MEDIAN at higher levels may not even make business sense. Something they'll let you know with further discussion, I'm sure. |
|
As far as your UDF, be sure you have an index on what you're doing your select on, this part: SELECT DISTINCT ItemPrice FROM dbo.Sales WHERE dbo.Sales.ItemsID = @mItemID AND dbo.tblTransactions.TransactionDate BETWEEN @mStartDate and @mEndDate ..an index on TransactionDate, ItemsID. Be sure it's being used by checking the performance plan. If it's not, hint it. I have a feeling you've probably already done this, but just to be sure. |
|
Something else to get some more performance, do a SELECT INTO instead of an INSERT INTO, to avoid the overhead of logging. When the UDF is done, you can delete that table created by the SELECT INTO. If massive amounts of rows get inserted at this point (which would happen at a higher level) you may want to investigate array processing in VB or C# as an alternative. Some food for thought. |
|
Best of luck. - Phil |
#9
| |||
| |||
|
![]() |
| Thread Tools | |
| Display Modes | |
| |