Conaxon

View Original

Predict which Sales Leads Close Part 2

Introduction

Last time we left off on this project, the technologies we chose weren’t particularly great at predicting one of the classes: Closed Lead versus Open Lead. In this article, a few different methods are employed to overcome the challenges of imbalanced classes, encoding all of the categorical variables, and hyper-parameter tuning.

Gradient-Boosted / Ensemble Algorithms Might Help

Firstly, we will import the GradientBoostingClassifier from sklearn. Catboost and XGBoost were considered for this investigation, but these algorithms are a little harder and more complicated to implement. Sklearn seemed to be more familiar and easier to understand and tune. This is not to say that Catboost and XGBoost are not good solutions—in literature, they are great!

See this content in the original post

Since the feature engineering and data cleaning have already been completed, we can create a dataframe that includes all of the features we want to include in the model:

See this content in the original post

After creating the dataframe as the backbone of the model, we are going to employ the first potential fix for the imbalanced classes. The option we chose to use is called over-sampling. Over-sampling is when you duplicate records from the minority class. Because the algorithm treats each record as a unique instance, duplicated records aren’t a problem and help us synthetically enhance our dataset. Since the minority class is a closed lead, we will over-sample this class:

See this content in the original post

This small piece of code will increase each of the closed deal records three times. Something to note: over/under sampling should not be the first line of defence. There are a myriad of other process-based fixes that should be used first such as:

  • Finding more data to build predictions

  • Investigate if bias is being introduced to the data collection in some way. Correct or reduce the level of bias in data collection

  • Gain some domain knowledge on the who, what, where, why, and how of the operations that build this data—make corrections to the prediction methodology where appropriate

Next, functions will be prepared to one-hot encode the features and label encode the labels:

See this content in the original post

Notice what is done inside the encoding functions. Fit to the training then transform the training and test set. Next, define the X and y:

See this content in the original post

X is everything but the target value. y is the target value. As always, perform your train, test split. You will notice that the ‘stratify’ parameter has been added to the train, test split. The ‘stratify’ parameter is a great tool for imbalanced dataset because it helps maintain equal proportions of X and y between the train and test data. It is best to ensure that one set does not monopolize the minority target:

See this content in the original post

Sklearn has useful one-hot and label encoding functions. After splitting the data, we can actually use our data preparation functions:

See this content in the original post

The very basic model can be instantiated and fitted to the Xtrain and ytrain:

See this content in the original post

The basic model does ok!, but not as great as the even more basic SupportVectorClassifier. With a baseline established, hyper-parameter tuning can be completed using GridSearchCV:

See this content in the original post

After the GridSearchCV the best parameters and scores can be collected:

See this content in the original post

Strangely enough, the outputs were worse than the baseline. If I am honest, I have no idea exactly why this is occurring. an initial hypothesis is that we trained the grid on the training set and not the whole set. Or, the grid doesn’t have all of the defaults of the GradientBoostedClassifer. Somewhat discouraged, I moved on to a Tensorflow Sequential Model (Artificial Neural Network)—which I thought was more fun to play with anyways.

Artificial Neural Network Application—Much Better

Since we already defined how we were going to prep the data for the model, the code will not be restated below. Development of the model will come first:

See this content in the original post

A few notes here:

  • We are going to use 4 layers here since the model is pretty complex and lots of features—over 1,300! Thanks to one-hot encoding

    • The input layer will have 1,350 nodes—generally, this can be set to the number of features or columns. A drop out layer has been added to dispense of useless nodes

    • Two hidden layers have been added. I have halved the nodes, added a regularizer to manage overfitting. The dropout layer is halved as well

    • Relu activation is used in the first layers because of a general consensus that relu is quite flexible. If we wanted to tune these values we could later

    • In the final hidden layer all the values, except for regularization, are halved to continue to simplify the model

    • Finally, the output layer will be a single sigmoid node due to our problem being a binary classification problem

  • We will be adding early stopping to ensure we do not over fit

  • Our loss function is going to use binary_crossentropy since this is a binary classification problem and this loss function should be appropriate

  • The optimizer to be used is the adam optimizer—highly flexible and generally works quite well

  • This initial model was set up somewhat arbitrarily and should be tuned if it is to be used in some sort of production application

Next, we can fit the model:

See this content in the original post

1000 epochs is going to be overkill, but the early stopping will ensure we never get close to 1000 epochs.

See this content in the original post

Learning in 15 epochs is pretty good! Next, we can show the losses:

See this content in the original post

I am pretty happy with the losses in the chart. Our scale is quite small even though the validation loss is not perfect. It is quite ‘chunky’. I believe this behavior occurs when adding dropout layers. With additional hyper-parameters, the gaps between training and validation loss could be further reduced.

Finally, we can get the predictions and determine performance:

See this content in the original post

Cool! The performance is quite good and without a ton of time spent training. By far, the neural network model gives a bigger bang for the buck.

Next time, we will fine-tune the parameters for the neural network.