Sales Forecasting: Predict Your Sales Cycle Using Machine Learning
Business Context:
There is no shortage of methods to forecast sales. To demonstrate one of those methods, we look back at the O-List data from Kaggle. The forecasting methodology that will be focused on is analyzing the sales cycle to predict how long a sales lead might take to close. So, not only will we be able to predict if a lead will close, but also how long it might take to close the deal.
The benefits of sales forecasting are pretty straightforward:
Improved financial planning
More precise work-load balance at each level of the organization
Better insights into velocity or growth
It needs to be said that this type of sales forecasting might not work for every business model and data model, for that matter.
In this post, the following will be covered:
Feature Engineering
Data Quality Improvements
Testing various models
Libraries & Reading in the Data:
Next, we merge the funnel and the qualified leads datasets:
Data Cleaning and Feature Engineering
The next section, we focus on some initial cleaning and feature engineering on the dataset. One thing to note with this investigation. There was not a ton of leads actually closed. So, we built some simulated data to increase the data that a model could be trained on.
An unfortunate deficiency of the O-list data is that there is not a reliable source of revenue data per lead. While we can successfully complete the task, the case study would be closer to real-world if there was more samples with a richer context. Next, a copy of the dataframe will be created and some time based features created:
Intuitively, the contact date information will be predictors for length of deal closure. These features will be especially important if there is any seasonality present. Since we only have a years worth of data in the set, it would be tough to make a judgement on seasonalities. The most important part of the data prep is addressing the NA’s that exist:
The next block of code addresses the NAs in the records given the dimensions above. Quite simply, a list of unique values was pulled from the dimension. Then, randomly those values are applied where the record is NA. This methodology was used to try and preserve as much of the distributions native to the data as possible, while also giving some more data to train.
The next block of code finishes off the feature engineering and data quality improvements:
This code should fill our NA fields with data that is representative of some reality—which should be sufficient to demonstrate the use case effectively. A few more lines of code to clean up the dataset:
Model Development, Train/Test/Split, & Defining X,y
In the code above, we one-hot encode the categorical variables. This is necessary because the algorithms we are going to use requires a binary array to be fed as inputs. I have found that it is best to split the data and THEN one-hot encode ‘X’.
support vector regression is up first
First, support vector regression is going to be used to predict the sales cycle time. We had good accuracy with this algorithm in our classification exercise.
Some interesting things to note here in this code:
The scale of measured accuracy is the same of the target variable. So, in this case the SVR model is about 73 days off. Not ALL the predictions are off by 73 days, but on average the predictions can be inaccurate by 73 days—this is not good!
Pay attention to the mean of y_test() though: 112 days. The dataset itself has quite a bit of variation in the time to close. The fact our predictions are quite a few days less than the y_test mean is actually positive
Given the positivity with this model, we can try to tune the hyper-parameters so as to improve accuracy:
Note here:
GridSearchCV is very slow given the number of features in the dataset—there are over 1500. We should probably be using some sort of dimensionality reduction in order to reduce the training time and make the process more efficient. SVR with GridSearchCV takes several hours to run in this investigation—that may not be tenable for some applications
GridSearchCV should be used smartly. The more variables that get added to the parameters the greater the training time
n_jobs set to -1 helps training time and optimizes the use of your machines resources
Large outliers in the data creates some difficulty creating accurate predictions—hence the terrible mean squared error.
Try a Simple Linear Regression:
After the long training of the Support Vector Regression and less than wonderful performance, maybe a simple linear regression will be more effective:
Linear Regression model seems to perform the best so far. A RMSE/MSE of 1-3 days is pretty accurate and serviceable for a sales forecasting. Still curious if we can further fine tune the results via Ridge Regression.
Ridge Regression Model Training:
The Ridge Model is not as performant! Using this model as a baseline, we can use RidgeCV to see if it is not possible to improve the results. The RidgeCV model can be set up as such:
The improved RidgeCV model performs similar to the more basic Simple Linear Regression.
Attempt a LassoCV
Next, we try to understand how a LassoCV might do in terms of accuracy:
Obviously, the model above is untenable and not a worthy candidate for something like a sales forecast—especially compared to the other models we have tried.
Elastic Net Models:
Since Elastic Net Models attempt to keep the benefits of both the Lasso and Ridge Models. A few lines of code will tell us what the performance result might be:
Conclusions:
ElasticNet is a pretty happy middle ground between Ridge and Lasso, but still doesn’t perform nearly as well as Linear Regression or RidgeCV
Model training time was vastly quicker on Linear Regression and RidgeCV—this might be an important consideration in a production implementation