Wednesday, March 7, 2012

Data Mining Clusters

In a response posted Nov 21 (Clustering Dimension), Jamie wrote...

"The only option of using a table-based model as a dimension is to write out the cluster labels and simply make the cluster label as a dimension attribute. You could even append the cluster label to the source data (e.g. the customer table) and not have a seperate dimension, simply a browseable attribute on the dimension of interest"

Jamie, can you provide more information on how to do this? We'd like to have a series of clusters in an existing household dimension. That is, we need multiple occurences of cluster model results over time browsable in the source cube. I've looked at the data source, dimension, and cube created by the data mining model, but I don't see where the case ID (Household Key) and the cluster name could be extracted to update the existing dimension. We're using the cube for the data mining source.

This would also help to fix a recurring problem we have with keeping the linked cube and the source cube metadata in sync. If I make a change to the source cube, say by adding a new measure, the metadata for the linked cube gets out of sync. I've been deleting the data mining dimension, cube, and dsv and them adding them back in using the data mining menu in the model.

Sorry for the late reply - you can use SSIS to take the results of a query e.g. SELECT t.HouseHoldID, Cluster() FROM MyModel PREDICTION JOIN .... , and save to a table. Then add the table to the DSV you use to process the cube. It may be easier to process the cluster model from the source data rather than the cube, although either option is entirely possible, and if you are using aggregated measures found in the cube, the cube method is likely better.

Unfortunately for your latter question, there is no way to keep the cube metadata in synch with linked measure groups. However, if the only difference is the presence of a single DM dimension, it is easy to drop and recreate the linked cube from the user interface. If you have more than one DM dimension, it's more difficult, but it's probably more reliable than trying to make all of your other changes in both places.

|||

Thanks for the repy Jamie. We are using aggregated values from the cube in our model. We'll probably use 4-5 cluster models in our cube, so being able to query them and add them to the DSV will save a lot of time rebuilding the data mining dimensions with the wizard.

Since we're not doing prediction in the query, wouldn't the DMX query be something like:

Select [Household ID], Cluster() From [MyModel]

When I run this I get:

Error (Data mining): Only a predictable column (or a column that is related to a predictable column) can be referenced from the mining model in the context at line 2, column 8.

Thanks for your help

|||

After some more reading, I think I understand the error. I've been thinking of retrieving the case ID and the cluster from the trained model, not issuing a prediction query to get the cluster name for a new case.

In our model, we're using clustering against a household dimension with a nested fact table containing a profitability fact and the fact components that make up profitability (i.e. clustering around the factors that make someone profitable or not.) These are not attributes that can be used to predict profitability, rather they describe it. It sounds like you're saying that we need to make profitability predictable and issue a prediction query on these facts to get the case ID (Household Key) and Cluster(). Is this correct? I've been trying to use the Mining Model Prediction tab to help me build the query, but I don't see how to get the nested case data in the query.

|||

You can enable drillthrough on the model (available in the wizard or the property page for the model) and then you can issue statements like

SELECT * FROM MyModel.CASES

If you want to get the cluster membership for each row you can issue a query like (check syntax) (you won't be able to create with the prediction query builder - you will have to enter by hand)

SELECT t.[Household ID], Cluster() FROM MyModel NATURAL PREDICTION JOIN (SELECT * FROM MyModel.CASES) as t

FYI, you can just add the nested table to a query, e.g.

SELECT t.[Household ID], t.[MyNestedtable], Cluster() ...

or if you want everything

SELECT t.*, Cluster() ...

No comments:

Post a Comment