Predict which Sales Leads Close Part 1
Setting the Stage:
Consider for a moment that your business is doing quite well. Sales is quickly climbing, the sales funnel is quite full, and customer service is top notch. But, because you are a responsible leader and manager the future appears somewhat hazy! Growth is wonderful, to be sure. However, growth can only scale as well as the sales team—and by extension, the rest of your operations. At some point the sales funnel will become exceedingly top heavy, business leaders will have to decide: do we hire more team members to support the increased demand or do we attempt to lean out somewhat so as to preserve margin, customer service, and specialization?
A tough, but highly personal choice.
I am willing to bet, a great many businesses would choose the option to lean out, maintain margins, and continue to develop productive their salespeople. There are a few pillars critical to projects where efficiency is the desired output, but maybe none more critical than tools. Having a diverse toolbox is essential. Data analytics and machine learning is quickly becoming an essential tool in the toolbox.
Proposed Solution:
All that said, I propose that machine learning could be used to predict which sales leads might close; therefore, allowing salespeople some insights into which leads should be prioritized first in the funnel. Secondarily, this algorithm could be used as a tool for identifying sales opportunities NOT being closed that may be critical now or in the future.
Business Context:
In order to demonstrate the capability of machine learning to address the aforementioned use case, we searched for a public dataset to perform tests. The team landed on a Kaggle dataset posted by a company called Olist—the largest department store in Brazilian marketplaces (link: https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv). This is a marketing funnel dataset from sellers that populated a form that requested to sell their products on the Olist Store. Olist connects small businesses from all over Brazil to channels without hassle and a single contract. Merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist’s supply chain partners.
The sales process is as follows:
Sign-up at a landing page
Sales development Representative (SDR) contacts lead, collects some information and schedules an additional consultancy
Consultancy is made by a Sales Representative (SR). The SR may close the deal or not
Lead becomes a seller and starts building their catalog on Olist
The products are published on Olist marketplaces and ready to sell!
The Dataset:
The dataset has information related to 8,000 Marketing Qualified Leads (MQLs) that requested a contact. these MQLs were randomly sampled from a larger set of MQLs.
The algorithm will use the data from the qualified leads daraset and closed leads dataset. A future projet might be demand/sales forecasting using the sellers dataset and order items dataset.
Jumping into the Data:
When testing, I like to use Jupyter Lab. I find it to be supremely easy to work with and lends itself to iteration, agility, and ease of use. First, we will import the libraries we will be using:
Now, we are going to read in the data for the analysis:
Next, we are going to combine the qualified leads and the closed deals to create a single dataset for generating predictions. The documentation from Kaggle was really great so there is no mystery as to how the join needs to be performed.
The next lines of code are going to be adding some potentially useful features and removing others that won’t be useful in the prediction model. The code is pretty simple and self-explanatory. Initially, time-to-close was thought to be useful in the prediction, but at the time of writing the report the features were not used in the model. Time-to-close would likely be more important as a business analysis task rather than a prediction task. The code is left in this report for reference anyways:
It is important to note that many of these dimensions were dropped because there was very little data to begin. There are instances of imputation later in the project that could have been applied to these dropped dimensions; however, there was little data to even be able to reliably impute from as a baseline.
The next line of code defines what we are going to end up trying to predict—a binary TRUE or FALSE classification:
This particular project did not attempt to understand time to close, but could easily be revisited at a later time.
Each project typically starts with some basic exploratory data analysis. I want to have a devent understanding of the spread in time-to-close, which SRs and SDRs are closing most often, and which features might have the most importance in a prediction. Let’s start with a basic understanding of which landing pages seem to close the most deals:
Next, it might be interesting whom are the most effective SRs:
Next, we look at the most effective SDRs:
We’ll finish off the exploratory data analysis with a cursory understanding of how long it takes to close a deal based on various features in the dataset. Again, this part of the business analysis is not strictly pertinent but potentially useful knowledge for further development. The business might find it useful in the future to have a prediction of when a deal might close—thereby allowing some ability to better understand potential revenue.
Overall, the data analysis wasn’t too conclusive but gave decent exposure to some of the intricacies of the dataset. You’ll notice that in many areas, the data is quite sparse and not many samples to develop a robust model. It would be advisable, like in most instances, to acquire more data to test and fine tune hyper parameters.
After the brief data analysis, we can begin to further clean and develop the features going to be used in the prediction:
There are a few things to note with the preceding code:
Extract the contact day, month, and year because the date alone is not going to be a useful predictor
Drop the time to close (for the time being) as it will not be used in the initial model
Drop the date in which the contract was won. The date itself and it’s date components also will not be useful
Drop the first contact date as it will not be a useful predictor itself
Drop the unique qualified id because it is not useful
Drop any duplicates in the dataset so we ensure that bias is less likely
The following code addresses a particularly thorny problem from an architectural standpoint. There were a significant number of ‘na’ or ‘nan’ values with the combination of the close deals and leads dataset. In order to properly demonstrate the use case, the ‘na’ and ‘nan’ values will need to be addressed through imputation. In this investigation, we are going to assume that if the won date is null, then the contract has been lost—thereby giving us a population of leads not won and those that have been won (remember the line of code above: funnel['closed_deal'] = funnel['won_date'].notnull()). There is no other identifier for a lead in progress or otherwise.
A simple line of code to determine the number of ‘na’ and ‘nan’ records is the following:
Given that there were so many missing feature values, imputation should be sufficient to demonstrate how to properly fill the gaps in the data model. Conceptually, the imputation employed was simple. A function was created to write the unique values from a feature to a list. Then, another line of code was written to randomly choose values from that list to fill the ‘na’ or ‘nan’ within the dimension:
After filling the ‘nan’ and ‘na’ values, a combination between the SDR id and SR id to build a feature that uses the combo to predict closure:
If we re-run the code for determining the ‘na’ or ‘nan’ value there should not be any left. You’ ll notice there is a single record still left ‘nan’ —so I drop it.
At this point, the data model should be set properly. First, we select the features to be used in the model. The next step will be to encode the categorical data. For this implementation, one-hot encoding was used as ordinal encoding did not seem to be appropriate.
It was a long road to get here, but the following code will apply to model development. Support Vector Classifier and Decision Trees from Sci-kit Learn will be used initially for the prediction. We will first define the X and y. ‘X’ being the features and ‘y’ being the values we are trying to predict:
After defining the X and y, complete the train, test, split. I have set the test size to be a bit smaller and the training set to be larger. Due to the imbalanced classes, the hope is more closed deals will end up in the training set to learn from.
Define the model and fit the model to the training data:
The model is insanely simple—almost comical levels of simplicity given the complex nature of the functions being performed. However, it is much easier to create baselines with simple models so that hyper-parameters could be effectively tuned. Next, we will build our predictions:
After the predictions have been made, it is easy enough to determine accuracy through a confusion matrix and classification report:
OOF! The results of this model reflects the terribly imbalanced classes that exist. To no surprise, the model is very accurate in regards to predicting which leads won’t close and terrible at predicting the leads that will close. It would be very unwise to use the overall accuracy score of 0.89 (89%) given that there is a biased preference by the model to predict that a lead will not close. All this said, we can try to see if a very basic Support Vector Classifier will be a more balanced model.
Feature Importances:
Using Grid Search, an optimized Support Vector Classifier was built to determine the best parameters possible for the basic model:
After the Grid Search is complete, we can find the best parameters for the model to baseline:
Not a terrible score! But, as we noted above, we need to break down this accuracy numbers into manageable components. We can do this by, first, building the simple model using the parameters found in the Grid Search:
After the model has been trained, we can now see the accuracy for the second model we built:
Based off this classification report above, the Support Vector Classifier performs significantly better than the Tree Based methods above—especially in the area of predicting the leads that will eventually close. There is nearly a 30% increase in the prediction accuracy of True while maintaining a high level of accuracy on False. A confusion matrix helps further contextualize accuracy:
Conclusion:
Initial modeling seems to indicate that the Support Vector Classifiers are better predictors than the Tree Based methods
Accuracy, especially for this use case, needs to be balanced across the True and False—especially if the data will continue to be imbalanced in the future
There are opportunities to leverage other features that were part of the original dataset, but had poor data quality. Assuming the data quality could be improved, there would be increased opportunities to improve business outcomes
Opportunities & Future work:
The imbalanced nature of the dataset should be addressed through oversampling, weighting, and attempting to use gradient boosted algorithms
Prediction of the time to close will likely be another worthwhile venture, especially when attempting to predict future sales, prioritizing lines of business, and even resource planning
Thorough discussion would need to occur between those that would use something like this in process and those that have designed the algorithm itself—industrialization would need to be done with care and monitored closely over time