Wednesday, March 7, 2012

Data Mining "How to"

Hello all,

I was handed a new project and I'm not quite sure how to do it or even what technology to use. Now, I think that SQL Server's Analysis Services could help (or even do the entire thing) but I'm not sure. So, basically, I'm looking for recommendations and ressources on the subject.

Here's my scenario:

We have a Data-Warehouse that stores information about the production of our company. Around this fact table, we have multiple dimensions including one about the origin of the production (who) and a time dimension (when).

We are buiding cubes and linking Excel to them for future analysis. Everything is done. Now, the all-mighty supervisors would like to have some predictions built-in to the system for the rest of the year based on the previous years. Basically, we would be adding production values for every possible time (the step is every hour, so the sum of the production comming from a certain origin is set as the fact and linked to a time dimension row for that hour).

As you can see, it's not very difficult and it seems to me like Analysis Services might already have that functionnality although I couldn't find it (seems like predictions can only be made by looking at relations found by the data mining algorithms and these predictions need to be made by a human because they are not really registered as "new" facts).

So, in short, I'd like to have a tool calculate future productions for the rest of the year and set a new column (or dimension) to a certain value indicating that this fact row has been predicted and is not certain. Of course, the refreshing of data would not create duplicate rows for predictions, newer predictions (probably based on more facts) would replace the last equivalent prediction.

I hope that's clear.

What do you recommend I use? Is it a case where I will need to build my own separate program (that updates the data-warehouse) or is it possible to achieve this using data mining?

Thanks a lot,

Skip.What algorythm are you using for forecasts? You aren't forecasting hourly for the entire year, are you?|||I sure am!!! In fact, I have to if I want to have users sort through the data, both forecasted and real, for any time step available. Plus, forecasting for every hour in a year only represents a maximum of 8760 rows (24 hours * 365 days) for every source. It's not *that* big.

Unless I'm missing something, I would like to forecast for every remaining hours in a year. If you have another suggestion, I'd be glad to look at it.

Thanks,

Skip.|||You can do this by creating a table of sequential numbers populated with the values 0 through 8760. Then you use dateadd(h, sequentialvalue, startdate) to create a dataset of all the values you want to return. You then select from this table left outer joining to your table of actual values, returning the actual values when they exist and applying your forecast algorythm when they do not.|||Thanks blindman,

Thing is, I've already done this. My question was more is there a forecasting algorithm in Analysis Services (or DTS, or SQL Server or anywhere else) or do I need to build my own?

Seems to me like you can only predict accross columns and not add new rows (predict trends and not the "future").

Skip.|||There are lots of different forecasting algorythms. Two popular ones are linear regression models (good for straight-line forecasts), and the Winter's model (good for systems that experience multiple cycles, such as daily, monthly, yearly, etc...).

I don't know of any built-in SQL functions or utilities that will do this. SQL server does not even have a function for linear regression. Simple one-variable linear regression is not too difficult to program through TSQL (I've done it in Access), but multi-variate can get pretty hairy.

Two options you might consider would be searching for third-party tools that do statistical operations on database data (such as SAS), or using Excel as a reporting interface so you can take advantage of it's extensive function library.

No comments:

Post a Comment