Wednesday, March 7, 2012

Data mining cube with 0-value facts

I'm defining a mining structure against an OLAP dimension. The continuous value that I'm using both as input and for forecasting represents the time to complete a certain process.

There's something that strikes me as if it could be a problem, but I'm not sure. Our fact table has multiple columns (with multiple correponding measures in the cube). The "time-to-complete" measure is only populated on some of the fact rows - the rows that represent completion information. Other rows represent other information, and the "time-to-complete" value is set to 0. This works fine for cumulative time-to-complete and average time-to-complete, but it seems like it could mess up data mining. Will those 0-value facts skew the mining results? I'm not seeing a way to filter out those entries and only include the non-zero facts in the mining processing.

Or perhaps I'm totally misunderstanding something, which is quite possible. :)

The zeroes will mess you up in DM and OLAP. An average of a value ignores nulls, but not zeroes.

You can set your measure to retain null values in the fact table - check out the data binding settings.

Hope this helps,

Richard

|||The zeros are properly handled in our average calculated measures in OLAP (we don't use the EE AverageOfChildren aggregation function).

Sounds like it will screw up DM stuff, though. If the values were nulls instead of 0, would DM handle it correctly?|||

Sine you've modeled the variable as continous, a 0 value will be treated as a possible value whereas a null would be treated as missing by DM. They would hence generate different DM model.

If the only zero value in your data correspond to null (missing) value, you can either filter those out from the input data or have another computed measure be null whenever the measure is 0 and use that measure as input to the data mining algorithm.

Hope this helps

|||>you can either filter those out from the input data

How would I do that? There isn't a dimension slice that represents this condition, and that's the only filtering that I've been able to find so far.

A computed measure wouldn't be ideal, but it might work. I'll think about that.
|||Sorry, my earlier statement was not accurate. You cannot filter by measure values in OLAP Mining Model.|||OK, after doing a little more testing, I'm more confused.

As a test, I created a new calculation that takes the original measure and converts 0-valued facts to Nulls. I added that new calculation to the mining structure and models and processed them. Then I ran a singleton prediction query for both the original and new measures. They came up with the exact same results. So, as far as I can tell, either:

a) The 0-value facts don't affect the mining results
b) The null facts are treated the same as the 0 facts (and not ignored)
c) Something about how I'm doing my test is flawed.

Anyone have any suggestions or thoughts here?|||It's possible that the column doesn't have any impact on the result at all? What algorithm are you using and does the column show up as "important" in the viewer?|||Actually, I'm working with a degeneratively simple test case here, where the mining structure only includes one input - a single continuous value, used both as input and for prediction. Perhaps that's too trivial for a reasonable test.

I'm using the Neural Network algorithm, but I can't bring up the viewer - I just get an error (which I reported in another thread in these forums).|||

Aha!

That's the reason for this problem and may be the reason for the viewer problem as well. It seems you found a bug in the system, the bug being that the software didn't return an error when you tried to create this model .

When you mark a column as "input" what you are stating is "use this column as an input to predict other columns." When you mark it as output you are stating "determine values of this column based on all other inputs." We have the concept of "input AND output" because we allow you to predict multiple attributes in a single model.

Therefore, what is happening is that for that column, you have no inputs, and it is an input for no other columns, since there are no more columns for which it could be an input. Simply put, it's an invalid model definition which we should have rejected from the start.

Thanks

-Jamie

|||Doh! Did I mention I was a noob with data mining? :)

I fixed the model, and the results seem to make more sense now. However, it didn't fix the problem with the mining model viewer.|||It's hard to say - there may not be enough differentiation for the viewer to show patterns at that point. I would try things out with a more well known dataset - e.g. the tutorial so you can get a feel for how the tools work in a tested environment and then see if the issue seems to be the software not working correctly or your data not being rich enough to demonstrate descriptive results.

No comments:

Post a Comment