Sales Forecasting: Predict Your Sales Cycle Using Machine Learning

Business Context:

There is no shortage of methods to forecast sales. To demonstrate one of those methods, we look back at the O-List data from Kaggle. The forecasting methodology that will be focused on is analyzing the sales cycle to predict how long a sales lead might take to close. So, not only will we be able to predict if a lead will close, but also how long it might take to close the deal.

The benefits of sales forecasting are pretty straightforward:

Improved financial planning
More precise work-load balance at each level of the organization
Better insights into velocity or growth

It needs to be said that this type of sales forecasting might not work for every business model and data model, for that matter.

In this post, the following will be covered:

Feature Engineering
Data Quality Improvements
Testing various models

Libraries & Reading in the Data:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import date
from datetime import datetime
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from tensorflow import keras

closed_deals = pd.read_csv('olist_closed_deals_dataset.csv')
olist_leads = pd.read_csv('olist_marketing_qualified_leads_dataset.csv')

Next, we merge the funnel and the qualified leads datasets:

funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')

Data Cleaning and Feature Engineering

The next section, we focus on some initial cleaning and feature engineering on the dataset. One thing to note with this investigation. There was not a ton of leads actually closed. So, we built some simulated data to increase the data that a model could be trained on.

funnel['won_date'] = funnel['won_date'].astype('datetime64[ns]')
funnel['first_contact_date'] = funnel['first_contact_date'].astype('datetime64[ns]')

#drop dimensions that likely will not be that reliable in collecting within the business process
funnel.drop(['declared_monthly_revenue','declared_product_catalog_size','average_stock','has_company','seller_id','mql_id'],axis=1,inplace=True)

An unfortunate deficiency of the O-list data is that there is not a reliable source of revenue data per lead. While we can successfully complete the task, the case study would be closer to real-world if there was more samples with a richer context. Next, a copy of the dataframe will be created and some time based features created:

funnel_model = funnel.copy(deep=True)

funnel_model['contact_day'] = funnel_model['first_contact_date'].dt.strftime('%d')
funnel_model['contact_month'] = funnel_model['first_contact_date'].dt.strftime('%m')
funnel_model['contact_year'] = funnel_model['first_contact_date'].dt.year

Intuitively, the contact date information will be predictors for length of deal closure. These features will be especially important if there is any seasonality present. Since we only have a years worth of data in the set, it would be tough to make a judgement on seasonalities. The most important part of the data prep is addressing the NA’s that exist:

#count missing values (NAs)
missing_count = pd.DataFrame(funnel_model.isna().sum(),columns=['Number'])
missing_count['Percentage'] = round(missing_count / len(funnel_model),2) * 100
missing_count

The next block of code addresses the NAs in the records given the dimensions above. Quite simply, a list of unique values was pulled from the dimension. Then, randomly those values are applied where the record is NA. This methodology was used to try and preserve as much of the distributions native to the data as possible, while also giving some more data to train.

origin_list = funnel_model["origin"].unique()
origin_list = [x for x in origin_list if str(x) != 'nan']
funnel_model['origin'] = funnel_model['origin'].fillna(pd.Series(np.random.choice(origin_list, size=len(funnel_model.index))))

sdr_id_list = funnel_model["sdr_id"].unique()
sdr_id_list = [x for x in sdr_id_list if str(x) != 'nan']
funnel_model['sdr_id'] = funnel_model['sdr_id'].fillna(pd.Series(np.random.choice(sdr_id_list, size=len(funnel_model.index))))

sr_id_list = funnel_model["sr_id"].unique()
sr_id_list = [x for x in sr_id_list if str(x) != 'nan']
funnel_model['sr_id'] = funnel_model['sr_id'].fillna(pd.Series(np.random.choice(sr_id_list, size=len(funnel_model.index))))

bs_list = funnel_model["business_segment"].unique()
bs_list = [x for x in bs_list if str(x) != 'nan']
funnel_model['business_segment'] = funnel_model['business_segment'].fillna(pd.Series(np.random.choice(bs_list, size=len(funnel_model.index))))

lead_list = funnel_model["lead_type"].unique()
lead_list = [x for x in lead_list if str(x) != 'nan']
funnel_model['lead_type'] = funnel_model['lead_type'].fillna(pd.Series(np.random.choice(lead_list, size=len(funnel_model.index))))

lbp_list = funnel_model["lead_behaviour_profile"].unique()
lbp_list = [x for x in lbp_list if str(x) != 'nan']
funnel_model['lead_behaviour_profile'] = funnel_model['lead_behaviour_profile'].fillna(pd.Series(np.random.choice(lbp_list, size=len(funnel_model.index))))

gtin_list = funnel_model["has_gtin"].unique()
gtin_list = [x for x in gtin_list if str(x) != 'nan']
funnel_model['has_gtin'] = funnel_model['has_gtin'].fillna(pd.Series(np.random.choice(gtin_list, size=len(funnel_model.index))))

btype_list = funnel_model["business_type"].unique()
btype_list = [x for x in btype_list if str(x) != 'nan']
funnel_model['business_type'] = funnel_model['business_type'].fillna(pd.Series(np.random.choice(btype_list, size=len(funnel_model.index))))

dt_list = funnel_model["won_date"].unique()
funnel_model['won_date'] = funnel_model['won_date'].fillna(pd.Series(np.random.choice(dt_list, size=len(funnel_model.index))))

The next block of code finishes off the feature engineering and data quality improvements:

funnel_model['won_day'] = funnel_model['won_date'].dt.strftime('%d')
funnel_model['won_month'] = funnel_model['won_date'].dt.strftime('%m')
funnel_model['won_year'] = funnel_model['won_date'].dt.year

funnel_model['time_to_close'] = funnel_model['won_date'] - funnel_model['first_contact_date']
funnel_model['time_to_close'] = funnel_model['time_to_close'].dt.days

#we can create a combo feature now of sdr and sr
funnel_model['sdr_sr'] = funnel_model['sdr_id'] + funnel_model['sr_id']

This code should fill our NA fields with data that is representative of some reality—which should be sufficient to demonstrate the use case effectively. A few more lines of code to clean up the dataset:

#drop any rows where there are already suspect data like when there is a negative close time
indexNames = funnel_model[ (funnel_model['time_to_close'] < 0)].index
funnel_model.drop(indexNames , inplace=True)

funnel_model = funnel_model.dropna(subset=['won_date'])

funnel_model = funnel_model.drop(['won_date','first_contact_date'],axis=1)

Model Development, Train/Test/Split, & Defining X,y

df = funnel_model.copy(deep=True)

#define the X, y variables
X = pd.get_dummies(df.drop('time_to_close',axis=1),drop_first=True)
y = df['time_to_close']

#always split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

# prepare input data
def prepare_inputs(X_train, X_test):
    ohe = OneHotEncoder(handle_unknown='ignore')
    ohe.fit(X_train)
    X_train_enc = ohe.transform(X_train)
    X_test_enc = ohe.transform(X_test)
    return X_train_enc, X_test_enc

from sklearn.preprocessing import OneHotEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

In the code above, we one-hot encode the categorical variables. This is necessary because the algorithms we are going to use requires a binary array to be fed as inputs. I have found that it is best to split the data and THEN one-hot encode ‘X’.

support vector regression is up first

First, support vector regression is going to be used to predict the sales cycle time. We had good accuracy with this algorithm in our classification exercise.

from sklearn.svm import SVR,LinearSVR
base_model = SVR()
base_model.fit(X_train_enc,y_train)
base_preds = base_model.predict(X_test_enc)
from sklearn.metrics import mean_absolute_error,mean_squared_error
mean_absolute_error(y_test,base_preds)
>> 73.15358191010293
np.sqrt(mean_squared_error(y_test,base_preds))
>> 93.54240378007681
y_test.mean()
>> 112.10120240480961

Some interesting things to note here in this code:

The scale of measured accuracy is the same of the target variable. So, in this case the SVR model is about 73 days off. Not ALL the predictions are off by 73 days, but on average the predictions can be inaccurate by 73 days—this is not good!
Pay attention to the mean of y_test() though: 112 days. The dataset itself has quite a bit of variation in the time to close. The fact our predictions are quite a few days less than the y_test mean is actually positive

Given the positivity with this model, we can try to tune the hyper-parameters so as to improve accuracy:

param_grid = {'C':[0.001,0.01,0.1,0.5,1],
             'kernel':['linear','rbf','poly'],
              'gamma':['scale','auto'],
              'degree':[2,3,4],
              'epsilon':[0,0.01,0.1,0.5,1,2]}

from sklearn.model_selection import GridSearchCV
svr = SVR()
grid = GridSearchCV(svr,param_grid=param_grid,cv=3, n_jobs=-1)
grid.fit(X_train_enc,y_train)
>> SVR(C=1, degree=2, epsilon=2, kernel='linear')
grid.best_params_
>> {'C': 1, 'degree': 2, 'epsilon': 2, 'gamma': 'scale', 'kernel': 'linear'}
print(grid.best_score_)
>> 0.9000862164349251
preds = svr.predict(X_test_enc)

from sklearn.metrics import mean_absolute_error,mean_squared_error
MAE = mean_absolute_error(y_test,preds)
MSE = mean_squared_error(y_test,preds)
RMSE = np.sqrt(MSE)
print(MAE)
>> 9.492129595872589
print(MSE)
>> 853.6142931876532
print(RMSE)
>> 29.21667833939466

Note here:

GridSearchCV is very slow given the number of features in the dataset—there are over 1500. We should probably be using some sort of dimensionality reduction in order to reduce the training time and make the process more efficient. SVR with GridSearchCV takes several hours to run in this investigation—that may not be tenable for some applications
GridSearchCV should be used smartly. The more variables that get added to the parameters the greater the training time
n_jobs set to -1 helps training time and optimizes the use of your machines resources
Large outliers in the data creates some difficulty creating accurate predictions—hence the terrible mean squared error.

Try a Simple Linear Regression:

After the long training of the Support Vector Regression and less than wonderful performance, maybe a simple linear regression will be more effective:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_enc,y_train)
# We only pass in test features
# The model predicts its own y hat
# We can then compare these results to the true y test label value
test_predictions = model.predict(X_test_enc)

MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 0.057113830282917256
print(MSE)
>> 3.253882131975636
print(RMSE)
>> 1.803852026075209

Linear Regression model seems to perform the best so far. A RMSE/MSE of 1-3 days is pretty accurate and serviceable for a sales forecasting. Still curious if we can further fine tune the results via Ridge Regression.

Ridge Regression Model Training:

from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train_enc,y_train)
test_predictions = ridge_model.predict(X_test_enc)

MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 7.6568575927885645
print(MSE)
>> 147.21235108694535
print(RMSE)
>> 12.133109703903008

The Ridge Model is not as performant! Using this model as a baseline, we can use RidgeCV to see if it is not possible to improve the results. The RidgeCV model can be set up as such:

# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
# Negative RMSE so all metrics follow convention "Higher is better"

# See all options: sklearn.metrics.SCORERS.keys()
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_root_mean_squared_error')

# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train_enc,y_train)

ridge_cv_model.alpha_
>> 0.1

test_predictions = ridge_cv_model.predict(X_test_enc)

MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 0.22657748641706255
print(MSE)
>> 3.4295154423291936
print(RMSE)
>> 1.8518950948499198

The improved RidgeCV model performs similar to the more basic Simple Linear Regression.

Attempt a LassoCV

Next, we try to understand how a LassoCV might do in terms of accuracy:

from sklearn.linear_model import LassoCV
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)

lasso_cv_model.fit(X_train_enc,y_train)

lasso_cv_model.alpha_
>> 3.0222656025281607

test_predictions = lasso_cv_model.predict(X_test_enc)
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 47.70224782216077
print(MSE)
>> 3692.4097816866115
print(RMSE)
>> 60.76520206241901

Obviously, the model above is untenable and not a worthy candidate for something like a sales forecast—especially compared to the other models we have tried.

Elastic Net Models:

Since Elastic Net Models attempt to keep the benefits of both the Lasso and Ridge Models. A few lines of code will tell us what the performance result might be:

from sklearn.linear_model import ElasticNetCV
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1],tol=0.01)
elastic_model.fit(X_train_enc,y_train)
test_predictions = elastic_model.predict(X_test_enc)
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 4.714205461929477
print(MSE)
>> 66.5452147436821
print(RMSE)
>> 8.157525037882635

Conclusions:

ElasticNet is a pretty happy middle ground between Ridge and Lasso, but still doesn’t perform nearly as well as Linear Regression or RidgeCV
Model training time was vastly quicker on Linear Regression and RidgeCV—this might be an important consideration in a production implementation

Tyler BetthauserFebruary 27, 2021Conaxon