Sales Forecasting: Predict Your Sales Cycle Using Machine Learning
Business Context:
There is no shortage of methods to forecast sales. To demonstrate one of those methods, we look back at the O-List data from Kaggle. The forecasting methodology that will be focused on is analyzing the sales cycle to predict how long a sales lead might take to close. So, not only will we be able to predict if a lead will close, but also how long it might take to close the deal.
The benefits of sales forecasting are pretty straightforward:
Improved financial planning
More precise work-load balance at each level of the organization
Better insights into velocity or growth
It needs to be said that this type of sales forecasting might not work for every business model and data model, for that matter.
In this post, the following will be covered:
Feature Engineering
Data Quality Improvements
Testing various models
Libraries & Reading in the Data:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np from datetime import date from datetime import datetime import xgboost as xgb from xgboost import XGBClassifier from sklearn.model_selection import StratifiedKFold from sklearn.metrics import roc_auc_score from sklearn.model_selection import RandomizedSearchCV, GridSearchCV from tensorflow import keras
closed_deals = pd.read_csv('olist_closed_deals_dataset.csv') olist_leads = pd.read_csv('olist_marketing_qualified_leads_dataset.csv')
Next, we merge the funnel and the qualified leads datasets:
funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')
Data Cleaning and Feature Engineering
The next section, we focus on some initial cleaning and feature engineering on the dataset. One thing to note with this investigation. There was not a ton of leads actually closed. So, we built some simulated data to increase the data that a model could be trained on.
funnel['won_date'] = funnel['won_date'].astype('datetime64[ns]') funnel['first_contact_date'] = funnel['first_contact_date'].astype('datetime64[ns]')
#drop dimensions that likely will not be that reliable in collecting within the business process funnel.drop(['declared_monthly_revenue','declared_product_catalog_size','average_stock','has_company','seller_id','mql_id'],axis=1,inplace=True)
An unfortunate deficiency of the O-list data is that there is not a reliable source of revenue data per lead. While we can successfully complete the task, the case study would be closer to real-world if there was more samples with a richer context. Next, a copy of the dataframe will be created and some time based features created:
funnel_model = funnel.copy(deep=True) funnel_model['contact_day'] = funnel_model['first_contact_date'].dt.strftime('%d') funnel_model['contact_month'] = funnel_model['first_contact_date'].dt.strftime('%m') funnel_model['contact_year'] = funnel_model['first_contact_date'].dt.year
Intuitively, the contact date information will be predictors for length of deal closure. These features will be especially important if there is any seasonality present. Since we only have a years worth of data in the set, it would be tough to make a judgement on seasonalities. The most important part of the data prep is addressing the NA’s that exist:
#count missing values (NAs) missing_count = pd.DataFrame(funnel_model.isna().sum(),columns=['Number']) missing_count['Percentage'] = round(missing_count / len(funnel_model),2) * 100 missing_count
The next block of code addresses the NAs in the records given the dimensions above. Quite simply, a list of unique values was pulled from the dimension. Then, randomly those values are applied where the record is NA. This methodology was used to try and preserve as much of the distributions native to the data as possible, while also giving some more data to train.
origin_list = funnel_model["origin"].unique() origin_list = [x for x in origin_list if str(x) != 'nan'] funnel_model['origin'] = funnel_model['origin'].fillna(pd.Series(np.random.choice(origin_list, size=len(funnel_model.index)))) sdr_id_list = funnel_model["sdr_id"].unique() sdr_id_list = [x for x in sdr_id_list if str(x) != 'nan'] funnel_model['sdr_id'] = funnel_model['sdr_id'].fillna(pd.Series(np.random.choice(sdr_id_list, size=len(funnel_model.index)))) sr_id_list = funnel_model["sr_id"].unique() sr_id_list = [x for x in sr_id_list if str(x) != 'nan'] funnel_model['sr_id'] = funnel_model['sr_id'].fillna(pd.Series(np.random.choice(sr_id_list, size=len(funnel_model.index)))) bs_list = funnel_model["business_segment"].unique() bs_list = [x for x in bs_list if str(x) != 'nan'] funnel_model['business_segment'] = funnel_model['business_segment'].fillna(pd.Series(np.random.choice(bs_list, size=len(funnel_model.index)))) lead_list = funnel_model["lead_type"].unique() lead_list = [x for x in lead_list if str(x) != 'nan'] funnel_model['lead_type'] = funnel_model['lead_type'].fillna(pd.Series(np.random.choice(lead_list, size=len(funnel_model.index)))) lbp_list = funnel_model["lead_behaviour_profile"].unique() lbp_list = [x for x in lbp_list if str(x) != 'nan'] funnel_model['lead_behaviour_profile'] = funnel_model['lead_behaviour_profile'].fillna(pd.Series(np.random.choice(lbp_list, size=len(funnel_model.index)))) gtin_list = funnel_model["has_gtin"].unique() gtin_list = [x for x in gtin_list if str(x) != 'nan'] funnel_model['has_gtin'] = funnel_model['has_gtin'].fillna(pd.Series(np.random.choice(gtin_list, size=len(funnel_model.index)))) btype_list = funnel_model["business_type"].unique() btype_list = [x for x in btype_list if str(x) != 'nan'] funnel_model['business_type'] = funnel_model['business_type'].fillna(pd.Series(np.random.choice(btype_list, size=len(funnel_model.index)))) dt_list = funnel_model["won_date"].unique() funnel_model['won_date'] = funnel_model['won_date'].fillna(pd.Series(np.random.choice(dt_list, size=len(funnel_model.index))))
The next block of code finishes off the feature engineering and data quality improvements:
funnel_model['won_day'] = funnel_model['won_date'].dt.strftime('%d') funnel_model['won_month'] = funnel_model['won_date'].dt.strftime('%m') funnel_model['won_year'] = funnel_model['won_date'].dt.year funnel_model['time_to_close'] = funnel_model['won_date'] - funnel_model['first_contact_date'] funnel_model['time_to_close'] = funnel_model['time_to_close'].dt.days #we can create a combo feature now of sdr and sr funnel_model['sdr_sr'] = funnel_model['sdr_id'] + funnel_model['sr_id']
This code should fill our NA fields with data that is representative of some reality—which should be sufficient to demonstrate the use case effectively. A few more lines of code to clean up the dataset:
#drop any rows where there are already suspect data like when there is a negative close time indexNames = funnel_model[ (funnel_model['time_to_close'] < 0)].index funnel_model.drop(indexNames , inplace=True) funnel_model = funnel_model.dropna(subset=['won_date']) funnel_model = funnel_model.drop(['won_date','first_contact_date'],axis=1)
Model Development, Train/Test/Split, & Defining X,y
df = funnel_model.copy(deep=True) #define the X, y variables X = pd.get_dummies(df.drop('time_to_close',axis=1),drop_first=True) y = df['time_to_close'] #always split the data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101) # prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder(handle_unknown='ignore') ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc from sklearn.preprocessing import OneHotEncoder # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
In the code above, we one-hot encode the categorical variables. This is necessary because the algorithms we are going to use requires a binary array to be fed as inputs. I have found that it is best to split the data and THEN one-hot encode ‘X’.
support vector regression is up first
First, support vector regression is going to be used to predict the sales cycle time. We had good accuracy with this algorithm in our classification exercise.
from sklearn.svm import SVR,LinearSVR base_model = SVR() base_model.fit(X_train_enc,y_train) base_preds = base_model.predict(X_test_enc) from sklearn.metrics import mean_absolute_error,mean_squared_error mean_absolute_error(y_test,base_preds) >> 73.15358191010293 np.sqrt(mean_squared_error(y_test,base_preds)) >> 93.54240378007681 y_test.mean() >> 112.10120240480961
Some interesting things to note here in this code:
The scale of measured accuracy is the same of the target variable. So, in this case the SVR model is about 73 days off. Not ALL the predictions are off by 73 days, but on average the predictions can be inaccurate by 73 days—this is not good!
Pay attention to the mean of y_test() though: 112 days. The dataset itself has quite a bit of variation in the time to close. The fact our predictions are quite a few days less than the y_test mean is actually positive
Given the positivity with this model, we can try to tune the hyper-parameters so as to improve accuracy:
param_grid = {'C':[0.001,0.01,0.1,0.5,1], 'kernel':['linear','rbf','poly'], 'gamma':['scale','auto'], 'degree':[2,3,4], 'epsilon':[0,0.01,0.1,0.5,1,2]} from sklearn.model_selection import GridSearchCV svr = SVR() grid = GridSearchCV(svr,param_grid=param_grid,cv=3, n_jobs=-1) grid.fit(X_train_enc,y_train) >> SVR(C=1, degree=2, epsilon=2, kernel='linear') grid.best_params_ >> {'C': 1, 'degree': 2, 'epsilon': 2, 'gamma': 'scale', 'kernel': 'linear'} print(grid.best_score_) >> 0.9000862164349251 preds = svr.predict(X_test_enc) from sklearn.metrics import mean_absolute_error,mean_squared_error MAE = mean_absolute_error(y_test,preds) MSE = mean_squared_error(y_test,preds) RMSE = np.sqrt(MSE) print(MAE) >> 9.492129595872589 print(MSE) >> 853.6142931876532 print(RMSE) >> 29.21667833939466
Note here:
GridSearchCV is very slow given the number of features in the dataset—there are over 1500. We should probably be using some sort of dimensionality reduction in order to reduce the training time and make the process more efficient. SVR with GridSearchCV takes several hours to run in this investigation—that may not be tenable for some applications
GridSearchCV should be used smartly. The more variables that get added to the parameters the greater the training time
n_jobs set to -1 helps training time and optimizes the use of your machines resources
Large outliers in the data creates some difficulty creating accurate predictions—hence the terrible mean squared error.
Try a Simple Linear Regression:
After the long training of the Support Vector Regression and less than wonderful performance, maybe a simple linear regression will be more effective:
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train_enc,y_train) # We only pass in test features # The model predicts its own y hat # We can then compare these results to the true y test label value test_predictions = model.predict(X_test_enc) MAE = mean_absolute_error(y_test,test_predictions) MSE = mean_squared_error(y_test,test_predictions) RMSE = np.sqrt(MSE) print(MAE) >> 0.057113830282917256 print(MSE) >> 3.253882131975636 print(RMSE) >> 1.803852026075209
Linear Regression model seems to perform the best so far. A RMSE/MSE of 1-3 days is pretty accurate and serviceable for a sales forecasting. Still curious if we can further fine tune the results via Ridge Regression.
Ridge Regression Model Training:
from sklearn.linear_model import Ridge ridge_model = Ridge(alpha=10) ridge_model.fit(X_train_enc,y_train) test_predictions = ridge_model.predict(X_test_enc) MAE = mean_absolute_error(y_test,test_predictions) MSE = mean_squared_error(y_test,test_predictions) RMSE = np.sqrt(MSE) print(MAE) >> 7.6568575927885645 print(MSE) >> 147.21235108694535 print(RMSE) >> 12.133109703903008
The Ridge Model is not as performant! Using this model as a baseline, we can use RidgeCV to see if it is not possible to improve the results. The RidgeCV model can be set up as such:
# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html # Negative RMSE so all metrics follow convention "Higher is better" # See all options: sklearn.metrics.SCORERS.keys() ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_root_mean_squared_error') # The more alpha options you pass, the longer this will take. # Fortunately our data set is still pretty small ridge_cv_model.fit(X_train_enc,y_train) ridge_cv_model.alpha_ >> 0.1 test_predictions = ridge_cv_model.predict(X_test_enc) MAE = mean_absolute_error(y_test,test_predictions) MSE = mean_squared_error(y_test,test_predictions) RMSE = np.sqrt(MSE) print(MAE) >> 0.22657748641706255 print(MSE) >> 3.4295154423291936 print(RMSE) >> 1.8518950948499198
The improved RidgeCV model performs similar to the more basic Simple Linear Regression.
Attempt a LassoCV
Next, we try to understand how a LassoCV might do in terms of accuracy:
from sklearn.linear_model import LassoCV # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5) lasso_cv_model.fit(X_train_enc,y_train) lasso_cv_model.alpha_ >> 3.0222656025281607 test_predictions = lasso_cv_model.predict(X_test_enc) MAE = mean_absolute_error(y_test,test_predictions) MSE = mean_squared_error(y_test,test_predictions) RMSE = np.sqrt(MSE) print(MAE) >> 47.70224782216077 print(MSE) >> 3692.4097816866115 print(RMSE) >> 60.76520206241901
Obviously, the model above is untenable and not a worthy candidate for something like a sales forecast—especially compared to the other models we have tried.
Elastic Net Models:
Since Elastic Net Models attempt to keep the benefits of both the Lasso and Ridge Models. A few lines of code will tell us what the performance result might be:
from sklearn.linear_model import ElasticNetCV elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1],tol=0.01) elastic_model.fit(X_train_enc,y_train) test_predictions = elastic_model.predict(X_test_enc) MAE = mean_absolute_error(y_test,test_predictions) MSE = mean_squared_error(y_test,test_predictions) RMSE = np.sqrt(MSE) print(MAE) >> 4.714205461929477 print(MSE) >> 66.5452147436821 print(RMSE) >> 8.157525037882635
Conclusions:
ElasticNet is a pretty happy middle ground between Ridge and Lasso, but still doesn’t perform nearly as well as Linear Regression or RidgeCV
Model training time was vastly quicker on Linear Regression and RidgeCV—this might be an important consideration in a production implementation