Predicting Churn with Machine Learning
Predicting churn with Machine Learning: the essential steps in terms of data collection, machine learning techniques, and evaluation.
Customer Retention and Machine Learning: Opportunities and Challenges
Retaining your customers in a competitive landscape is a challenging problem for most companies. Repeated sales to customers are generally a great sign: beyond the obvious fact that they positively impact your bottom line, they give an indication that your customer is loyal, and perhaps even so loyal that they are generating new leads and customers for you. On the flip side, customers who leave can be a source of bad publicity or a sign that something is amiss in your organization. It’s not surprising then that customer retention is a top-of-mind concern for most executives. Nevertheless, it’s often hard for executives to understand why customers stay (or leave), and to confidently take action to actively decrease customer churn.
Net Promoter Score
The most common practice to understand customer retention and loyalty is the use of the “Net Promoter Score” (NPS): a recurrent survey that asks customers how likely they are to promote the services or products to people they know. There are a number of reasons for the popularity of NPS: firstly, it is one of the few retention metrics that is easy to administer. Satisfaction surveys tend to have very low response rates, and as such, asking just a single question helps to maximize the chance that you will get feedback from customers. Secondly, the NPS has been researched thoroughly. This means that there’s evidence that NPS correlates with revenue growth, giving confidence to executives that it’s a worthwhile metric to attempt to improve.
Net Promoter Score Weaknesses
Nevertheless, the metric has two fundamental weaknesses. Firstly, as with all surveys, it is surprisingly hard to have both the quantity of responses and the reach throughout your customer base so as to make the score meaningful. Many companies’ NPS is based on their online customers (the place where it’s easiest to administer the survey), but this can often hide the fact that digital savvy customers might have a very different behavior and experience of your company than non-digital natives. Furthermore, most companies have extremely low response rates, making any conclusion you can draw from one’s NPS shaky at best. Secondly, and more fundamentally, the NPS doesn’t tell you why customers are loyal. Is it the new training you’ve given to your support staff? The expensive marketing action regarding your new product? The quality of your latest product? Without deeper insights into the actions that matter to customers, executives are left in the dark as to which action to take to best move their company forward. Finally, the NPS doesn’t give a reliable indication of how improving it will affect your bottom line. Because the correlations between revenue growth and NPS are weak, an executive cannot just translate an “x%” increase in NPS to a “y%” increase in revenue growth.
The value of Customer Behaviour Data
So given all of these weaknesses of NPS, what’s an executive to do? We argue that rather than looking at what their customers say, most companies would benefit from looking at what their customers do. Customer’s behavior, and to a certain extent customer’s characteristics, are a much richer source of insights into customer loyalty and retention than NPS. For example, a customer contacting your support desk multiple times with the same issue without a satisfactory answer might be a great predictor of customer churn, or a customer increasing its purchases with you over time might be a great indicator of increasing loyalty. Nevertheless, this much more complex, multifactorial analysis of customer loyalty is often hard to come by and make actionable.
Luckily, machine learning algorithms can help with this in two specific ways. Firstly, machine learning algorithms can create a full listing of the reasons why customers churn, with a quantified business impact of these reasons. For example, an algorithm might predict, with a certain margin of error, that one of the reasons why customers risk of churn increases is that their support ticket is not treated within 2 business days. The algorithm might further predict that reducing the average time it takes to treat a ticket by 1 business day would generate a positive bottom-line impact of 180k EUR per year. This number then allows executives to evaluate whether reducing the treatment time of support ticket is a worthwhile goal. This detailed understanding of the reasons behind churn is an extremely powerful tool for executives to take and help guide their actions in a quantitative, measurable way. Secondly, machine learning algorithms can make individual predictions for each customers’ loyalty and risk of churn. These loyalty and churn scores can help marketing and sales executives to craft extremely personalised campaigns, and help support professionals to be both more proactive and tailored in their approaches to customers.
Given the benefits of using machine learning for loyalty and churn analysis, how does one go about actually creating these predictions? There are three intertwined elements to doing so:
- Collecting data,
- Selecting and training one or multiple algorithms on this data, and
- Measuring how well the trained algorithm makes a prediction on data it hasn’t previously seen before.
We are further going to take you through evaluation of these three elements in churn prediction.
In order to make the data collection process efficient, we recommend organizing the data sources using two axes: data availability, and expected explanatory power. The first axis is self-explanatory: some data about customer behavior and characteristics is easier to collect than other data. Usually, we see that some internal data is better structured and easier to extract from our customers’ systems that other data. Furthermore, we can often use external data - either from public data sources or purchased from private partners to further create a complete data map. Here too, some data is more easily available than other. The second axis, expected explanatory power, is evaluated by interviewing executives and experienced team members and using the experience of machine learning practitioners, and measures the expected “return on data” for each data source. Once the data is organized along these two axes, it is easy to pick the most promising sources to prepare and feed to an algorithm for training.
Algorithm selection and training- supervised and unsupervised learning
Many algorithms exist to train on data, and selecting the right algorithm for the right task is a non-trivial problem. Within these algorithms, several families exist, of which “supervised learning” is the one that is the most frequently used within the realm of churn and loyalty. This family of algorithms is trained on data that has both “features” and an “objective”. In the case of churn prediction, a feature might be “has this customer contacted the help desk in the last two months” and the objective “has this customer cancelled their subscription with us”. Usually, these datasets contain many, many features (as a consequence of the data collection), and not all features might be as important to predict the objective. Furthermore, some features might interact in subtle ways: for example, customers who contacted the help desk in the last two months (feature 1) but who had a swift response (feature 2) might actually have a lower risk to churn than customers who never contacted the helpdesk at all. It’s the algorithm’s goal to discover all the subtle interactions and paint the correct picture.
Within this family of supervised learning algorithms, two subsets exist:
- Classification algorithms (whose goal is to predict an outcome that has distinct classes, such as “will churn” or “won’t churn”), and
- Regression algorithms (whose goal is to predict an outcome that is “continuous”, such as “in how many days will this customer churn”).
An important challenge of churn classification, that also exists in problems like fraud detection, is “class imbalance. More specifically, in most cases, there are fewer customers who churn than customers who do not churn. Machine learning algorithms usually expects a balanced repartition of classes. While the abundance of one class over the other does not hinder the learning task by itself, it does amplify a number of complexities that need to be dealt with, such as dealing with noise and class overlap. Fortunately, the machine learning community has developed specific algorithms to deal with class imbalance, also called imbalanced learning. At Kantify, we try to make the best out of this research field and existing machine learning algorithms to extract value from your data.
Another family of algorithms is called “unsupervised learning”, and is less frequently used within the churn context, but is worth mentioning in passing. This family of algorithms is trained on data that only has “features”, and aims to group together the data in segments that are as similar as possible.
Measuring the algorithm’s performance
In order to measure how well an algorithm has learned on training data, we measure how well it makes a prediction on data the algorithm has never seen before (called “out of sample” data) for which we know the true value. An extremely simplified example might be directed to an algorithm: “given that a customer has contacted the helpdesk and their question was resolved promptly, have they cancelled their subscription?” We then compare the algorithm’s answer with the real result to evaluate the performance of our algorithm.
Supervised learning comes with a long list of metrics dedicated to evaluate the performance of machine learning algorithms. In case of binary classification, most of them are derived from a confusion matrix. This piece of information is central to all classification problems and is an important piece of information for stakeholders involved in implementing churn prediction AI pipelines. Figure 1 below is a confusion matrix.
|Remains - actual||Churns - actual|
|Remains - predicted||True Negatives (TN)||False Negatives (FN)|
|Churns - predicted||False Positives (FP)||True Positives (TP)|
The performance of many machine learning problems is based on “accuracy”, which is computed from the confusion matrix as:
Accuracy = (TP + TN) / (TP + FP + TN + FN)
In other words, accuracy shows the number of instances correctly classified out of all instances. While this single metric suits many situations, it is not the best fit when one faces class imbalance, as in churn prediction. A simple example shows the limitations of accuracy in this case - a dataset made of 100 observations among which only 10 customers churn. A naive algorithm might predict that not a single customer churns. This algorithm would yield the following confusion matrix:
|Remains - actual||Churns - actual|
|Remains - predicted||TN = 90||FN = 10|
|Churns - predicted||FP = 0||TP = 0|
This bad algorithm results in an accuracy of 90% - clearly not what we’d expect. Fortunately, other metrics exist.
Sensitivity and Specificity
Sensitivity and specificity are class-specific metrics and they are simply derived as:
Sensitivity = TP / (TP + FN) = 0 / (0 + 10) = 0% Specificity = TN / (TN + FP) = 90 / (90 + 0) = 100%
The algorithm in our example above is actually (naively) very good at one task, namely correctly predicting who is not going to churn (specificity), but terrible at another task, namely who is going to churn (sensitivity). Depending on the actions to take upon predictions, it is important to establish from the start the metric that needs to be optimized. For instance, sending emails to a selected list of potential churners is not expensive and thus the quality of the specificity matters less than the sensitivity: you simply want to catch all churners regardless of the non-churners being contacted. In this specific case, we might not even need an algorithm: sending a mail to everyone might just be a good strategy. On the other hand, setting up deeper retention campaigns such as discounts or pricing updates might require achieving both great sensitivity and great specificity.
A good metric that balances both sensitivity and specificity is called “balanced accuracy”: it simply computes the arithmetic average of the sensitivity and specificity:
Balanced accuracy = ½ * (Sensitivity + Specificity) = ½ * (100 + 0) = 50%
Our balanced accuracy gives a better indication that our naive algorithm is actually not that good at all, and is much more effective than regular accuracy in a case of churn. Many other metrics exist (F1-measure, AUC, …) and they are worth being considered along a churn prediction pipeline that involves expensive retention actions. Applications Predicting churn using machine learning has many benefits for executives looking to work on customer retention and churn reduction. Through careful data selection and curation, model training and metric evaluation, it’s possible to create models that allow executives to make the biggest possible impact on their organization. Churn prediction helps you increase customer retention, improve customer experience and increase sales. Read more about the value of keeping the right customers [here/link].
Kantify is a team of specialists in business and machine learning, helping organisations maximise customer engagement using AI powered solutions. Our Headquarters are based in Belgium. We also operate in other countries. If you’re interested in discovering whether predicting churn makes sense in your organization, feel free to contact us.
If you are interested in further understanding how to define customer churn specific to your case and your data sources, Contact us!