In SAP Analytics Cloud (SAC) Smart Predict, you can create three types of predictive scenarios:
- Time Series
A few months ago, I wrote a blog to detail the steps to create a predictive model from time series data.
Through this blog, I will now explain how to create a classification predictive model. Classification is used to rank a population relatively to the probability that an event will happen. For example, who among my customers will react positively to my marketing campaign.
All businesses face questions about upcoming events for their customers, their industrial assets, or their marketing campaigns. Business life is made of events: churn/no churn, failure/no failure, sale/no sale, delay/on time to list the most common. That’s what classification do: associate to each entity of a population its probability that an event will happen.
Forecasting values over time is essential and allow to obtain an estimation of business value evolution. But it brings no information about individual behavior (what is the profile of the churners?). It may also happen that forecasts cannot be produced with time series forecasting algorithm. In this case, classification may be a solution to estimate the rate of a future event.
To help you understand how to leverage a classification predictive model in SAP Analytics Cloud Smart Predict, I will first explain the type of problems you can address to a classification predictive model. Then, I will take a use case to explain how a classification predictive model is built in Smart Predict. Finally, I will go through the different tools that help you to assess the accuracy of the predictive model and to decide whether to use it or not.
Which questions? Which data?
Smart Predict Classification is useful to answer to questions of this form:
“Who is likely <event> to next <period length>?”
Here are some examples:
- Who is likely to buy this product next week?
- Who has a high probability to fraud next year?
- Who certainly churn during the next month?
To rank a population, the classification predictive model in Smart Predict generates an equation, which predicts the probability that an event happens. It can address today only binary cases. But before to dig into the details of a classification, check whether your data can be used to create a reliable predictive model.
As we saw it in this blog, the quality of data is important to build a reliable predictive model. It is necessary to prepare the historical data that best represents your application domain and your predictive objective as it is used to build the classification predictive model. This means that you must:
- Select a data source.
- Select the variables that best describe your use case. Maybe it is necessary to create new variables with a formula based on existing variables, or to aggregate some variables together. You can do this thanks to your knowledge of the application domain.
Each of these variables must be unique in the dataset.
During the dataset preparation, it can be necessary to filter your data to discard those that will not help for your use case. For example, filtering customers who have already churned is necessary if you want to predict those who might churn! If you keep them, the population will be wrong, and the prediction will be wrong.
Among the variables available in your dataset, there is one which has a specific role: the target variable, which represents the event to be predicted. For example, in the churn detection use case, the target variable is the decision of the customer to churn or stay.
Use case: who could churn?
Let’s take a classic human use case to illustrate explanations of the next sections: A Human Resources manager would like to improve the HR policies in his/her company. The goals are:
- Improve employee satisfaction to reduce resignations,
- Reduce cost of training and ramp-up of new employees, and
- Hire better talent.
Objectives which Smart Predict are:
- Understand reasons for disaffection,
- Act on these reasons proactively to avoid losing employees,
- Identify which employees might potentially leave the organization.
Classification modeling process
Theoretical explanations about Smart Predict classification
To understand how Smart Predict generates a classification predictive model and what is the information given in the debrief, it is necessary to have some theoretical explanations. The purpose is not to give you a formal course, but to expose you the main principles.
The SRM principle is to find and to point out the best function able to reproduce the target behavior.
Finding such a function (or predictive model) is one thing. The next question is: what is a good predictive model?
If the curve “f” goes through all the points, the predictive model over fits and is not able to predict correctly new cases. The predictive model is not robust.
At the opposite, if the predictive model is too simple, it under fits. It has a high robustness, but the precision of its predictions is low.
A good predictive model is a compromise between both. The complexity of the class of functions is called the VC dimension. It is used to determine the best predictive model.
To explain shortly, this is based on 4 principles:
- Consistency: If the number of cases is large enough, then the error on new cases is identical to the learning error. The predictive model is qualified as robust. It can be trusted when applied to new cases.
- Convergence speed: The ability to generalize increases when the number of cases increases. Smart Predict is sensitive to the size of the training dataset. Predictive models are robust if the number of cases is high enough.
- Generalization Capacity Control: If the ratio number of cases by the VC dimension is large, the predictive model generalizes well. The empirical risk of error (the accuracy) of the predictive model is minimized as well as the generalization error (the robustness). But if the ratio is small, the predictive model over fits. There is an optimal balance between training and generalization errors as shown in figure 4.
Good Algorithm Strategy: Figure 4 shows that the best compromise between accuracy (Remp small) and robustness (Rgen small) is obtained with a complexity h where robustness is minimum. SRM theory recommends to progressively increase h and build a succession of predictive models until the minimum of Rgen is found.
Based on these 4 principles, the training dataset is split in two parts:
- A training subset used to build a model
- A validation subset used to measure accuracy and robustness of the model
The process to build a classification predictive model is an iteration of model’s construction with greater VC dimension at each step, until the error on validation dataset increases. The final predictive model is the best model obtained at previous step as illustrated in figure 5.
Data encoding is the component which quickly and automatically transforms raw data into a “mineable” source of information. It automates data preparation and pre-processing:
- Encode variables for modeling algorithms,
- Transforms continuous variables to catch non-linear relationships with data,
- Compress variables by grouping categories,
- Automatically handle missing values and outliers. Missing values are replaced by a specific constant. Thus, they are treated by the predictive model as any other category.
To produce a predictive model, Smart Predict proceeds like this:
- It does a pre-variable exclusion to eliminate upfront low explanatory or unstable variables.
- It does a post-variable selection where the predictive model complexity reduction is controlled by the scoring performance.
- Finally, for the classification, the score is converted into a probability.
Understand the outputs of a Smart Predict classification
Build a classification predictive model
In this section, I go back to the use case mentioned above. First, you create a Classification Predictive Scenario named “HR Flight Risk”. In the settings used to build the predictive model, you notice that the training data source is the training dataset. I don’t need to precise the split between training and validation subset. Smart Predict will do the job for me with the default ratio (75% randomly picking for training and the remaining 25% for validation).
The target is the variable “Flight_Risk” where value 1 means “will churn” and value 0 means ‘Will stay”.
Note that some variables are excluded from the analysis:
- “Employment Status” because it is highly correlated to the target on the training dataset. This type of variable is called a leaker variable.
- “Successor_Readiness” because employees who are listed as inactive are out of the scope of the flight risk study!
Now, I save, and I train the predictive model. After few seconds, Smart Predict has completed the training step. The status is displayed as in figure 7.
To assess if a predictive model can be used, we need to know its robustness. This means that the predictive model must detect with consistently (robustness), with enough value (quality). The robustness is measured by the Prediction Confidence while the quality is measured by the Predictive Power.
These two indicators are computed based on the graph shown in figure 8, which represents how good the « positive » cases are detected.
The X axis represents the overall population and the Y axis represents the positive cases.
The red line is a random picking. This means that there is no predictive model. To take an image, it’s like to play to heads or tails. On the opposite, the green curve represents a perfect predictive model.
The orange and blue curves correspond to the percentage of positive cases when ordering by decreasing score / probability.
Predictive Power represents how close to the perfect model is the predictive model. Area between Validation and Random curves divided by the area between Perfect and Random curves.
Predictive Power = C/(A+B+C)
The role of the Predictive Power is to give an idea of the quality of the predictive model. It is based on the predictions done from the validation dataset. It is the ratio between the number of correct predictions and the number of cases. Its range is between 0 and 1:
- When it takes value of 0, it means that nothing in the validation subset is classified correctly by the predictive model. The quality of the predictive model is very bad. You need to rework on your data preparation and add variables to your dataset.
- When it takes value of 1, we are near the wizard. The predictive model looks perfect, but by experience, it is false. You need to look again at your data and check if there is a dependency between some variables and the target. An acceptable Predictive Power should be between 0.75 and 0.96
Prediction Confidence expresses the ability to reproduce the same detection with a new dataset. A « validation sample » is necessary to estimate it. It represents another view of the same population and is equal to one minus area between Validation and Training divided by area between Perfect and Random
Prediction Confidence =1- B/(A+B+C)
The role of the Prediction Confidence is to measure if the predictive model can do the predictions with the same reliability when new cases arrive. If these new cases look like to cases of the training dataset, then the Prediction Confidence will be good. An acceptable Prediction Confidence should be greater than 0.95. Below this value you should consider increasing the number of cases in the training dataset to cover more situations.
Another part of the debrief of a classification is the contributions of variables. They are displayed sorted by decreasing importance. The most contributive ones are the ones that best explain the target. The sum of the contributions equals 100%.
For our use case, the organization unit is the most contributive variable. But let’s continue the analysis of the debrief.
If you scroll down in previous figure, you will get a detailed report about the influencer contribution.
Variable “Age” influences also the target. The graph of figure 10 shows that people between 53 and 70 and people between 21 and 48 have positive influence on the target. This means that if an employee is in one of these ranges of age, his/her flight risk will be higher than if he/she is in the range ]19, 21]. It corresponds to people who are near retirement (]53, 70]) and very active and senior people (]21, 48]). But can we act on age to avoid flight risk? Or do we want to act to retain some age categories? Let’s examine variable “Successor”.
When the employee could be a successor but is not still ready to be the successor of another employee (because of the juniority in the position, for example) then the risk to see him/her goes away is higher. It can be considered as an impact of a bad career management.
On the opposite, when an employee has not been identified as a successor and for the other values of variable “Successor”, the risk to see such employees to leave the company is lower.
Finally look at job family influencer.
It appears that most job families are more likely to be flight risks in comparison to Administrative Support. It seems that having a position of Directors or Senior Manager increases the risk of employee churn compared to other jobs.
Smart predict also proposes a tool called Confusion Matrix. It helps you navigate in the curve of the percentage of the detected target. You choose a threshold on the percentage of population, and the classification predictive model tells you within this threshold, the percentage of the population who answers positively.
For example, if you select 30% of the employees with the highest attrition probability because your budget is limited, the classification model tells you that within these 30 %, you capture 65% of employees who will probably leave. You can focus your actions to reduce attrition of these. But if you put the threshold at 45%, you will capture only 80% of employees.
You can select with the slide bar the 30% of employees with the highest attrition probability and see the percentage of employees you will capture.
Note that you can also do the opposite action and select first the percentage of employees you want to capture and see the percentage of employees to contact with the highest attrition probability.
The confusion matrix will show you the performance of the predictive model by comparing the predicted values of the target with its actual values.
I just recall some definitions to help you understand how to interpret this confusion matrix. To get all definitions, read the section of the online help.
The main diagonal represents the True Positive and the True Negative. In other words, it is when the Classification predictive model does correct predictions.
- True Positive is when the classification model predicts 1 and it is effectively 1.
- True Negative is when the classification model predicts 0 and it is effectively 0.
The second diagonal represents False Positive and False Negative. In other words, it is when the Classification model does incorrect predictions.
- False Positive is when the classification model predicts 1 but it is in fact 0.
- False Negative is when the classification model predicts 0 but it is in fact 1.
Here is how to read this confusion matrix. When you contact 30% of employees (blue rectangle) with the highest attrition probability, the classification rate is 73.86% (green rectangle). It represents the percentage of employees correctly classified by the predictive model. The percentage of true positive is 8.12% (yellow rectangle) whereas the percentage of employees who will flight (the actual positive) is 12.4% (purple rectangle). The sensitivity (red rectangle) measures the percentage of employees who will flight that have been correctly classified. Here it is 65.05%
Similarly, the specificity (light blue rectangle) measures the percentage of employees who will stay (actual negative) that have been correctly classified (true negative). Here it is 75.04%.
Taking actions to reduce attrition is not free and maybe you would like to know the impacts and costs of your decisions. This is the objective of the Profit Simulation.
To each situation, a cost can be associated. In some use cases, when predictions are incorrect, the cost for a False Negative can be more important than the cost for a False Positive because you lose money. You will not keep a valuable employee, and this is more important than to give motivation to an employee that want to stay.
The difficulty is to estimate a relevant threshold (30% in our use case). Once you have set the cost of good predictions and the cost when you predict that an employee will effectively flight (yellow rectangle in the figure below) , you can use the profit simulation and move the slide bar to change the percentages of contacted employees and detected target.
The profit simulation role is to choose a threshold. One way to choose that threshold is to optimize the profit. In our use case, with these costs, the best profit (click on “Maximize Profit” button) is obtained when you will contact 19.9% of the employees.
Using a classification predictive model
Until now, we have seen how a classification model is built. What kind of insights Smart Predict provides. How to interpret and to use this information to better know data and take appropriate actions. There is another usage of a classification model: apply it on new cases: this means that you will use the predictive model on a new dataset, where the target variable is unknown.
From the human resources use case, the classification predictive model gives the probability of an employee to leave the company. See how this work.
Select a classification model and click on the “Apply” icon. In the dialog you enter:
- The data source. It’s a dataset with the same information as the training dataset but where the values of the target variable (here Flight_Risk) is unknown as this is what you want to determine.
- The output is also a dataset. You give it a name and a location.
- The replicated columns are the variables of the input dataset you want to retrieve in the output dataset.
- To this output, you can add other statistical columns. The minimum to be useful is to add the probability of the predicted category.
Once this process is completed, a message is displayed in the status pane.
Let’s now examine the predictions generated in the dataset HR_Flight_Risk_Predictions.
In SAP Analytics Cloud, browse to the location where you have stored this output dataset and open it. The last columns of the dataset contain the predictions.
While it is great to see each individual employee, it may be easier for us to consume this information through visualizations! To do this, we need to create a BI model out on top of the output dataset to use the data in a BI or planning story as shown in figure 20.
Congratulation, you have reached the end of this blog. My wish is that I have clarified the way Smart Predict is doing classification, your understanding of the insights provided in the debrief section and how to use the predictions in a story. I also hope these explanations increase your confidence in the product.