Predict which Sales Leads Close Part 2

Introduction

Last time we left off on this project, the technologies we chose weren’t particularly great at predicting one of the classes: Closed Lead versus Open Lead. In this article, a few different methods are employed to overcome the challenges of imbalanced classes, encoding all of the categorical variables, and hyper-parameter tuning.

Gradient-Boosted / Ensemble Algorithms Might Help

Firstly, we will import the GradientBoostingClassifier from sklearn. Catboost and XGBoost were considered for this investigation, but these algorithms are a little harder and more complicated to implement. Sklearn seemed to be more familiar and easier to understand and tune. This is not to say that Catboost and XGBoost are not good solutions—in literature, they are great!

from sklearn.ensemble import GradientBoostingClassifier

Since the feature engineering and data cleaning have already been completed, we can create a dataframe that includes all of the features we want to include in the model:

df5 = funnel_model[['landing_page_id', 'origin', 'sdr_id','sr_id','business_segment',
                   'lead_type','lead_behaviour_profile','has_gtin','business_type',
                  'contact_day','contact_month','contact_year','sdr_sr','closed_deal']].copy()

After creating the dataframe as the backbone of the model, we are going to employ the first potential fix for the imbalanced classes. The option we chose to use is called over-sampling. Over-sampling is when you duplicate records from the minority class. Because the algorithm treats each record as a unique instance, duplicated records aren’t a problem and help us synthetically enhance our dataset. Since the minority class is a closed lead, we will over-sample this class:

closed_dup = df5['closed_deal'] == True
df_try = df5[closed_dup]
df5 = df5.append([df_try]*3,ignore_index=True)
df5.shape

This small piece of code will increase each of the closed deal records three times. Something to note: over/under sampling should not be the first line of defence. There are a myriad of other process-based fixes that should be used first such as:

Finding more data to build predictions
Investigate if bias is being introduced to the data collection in some way. Correct or reduce the level of bias in data collection
Gain some domain knowledge on the who, what, where, why, and how of the operations that build this data—make corrections to the prediction methodology where appropriate

Next, functions will be prepared to one-hot encode the features and label encode the labels:

# prepare input data
def prepare_inputs(X_train, X_test):
    ohe = OneHotEncoder(handle_unknown='ignore')
    ohe.fit(X_train)
    X_train_enc = ohe.transform(X_train)
    X_test_enc = ohe.transform(X_test)
    return X_train_enc, X_test_enc
 
# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc

Notice what is done inside the encoding functions. Fit to the training then transform the training and test set. Next, define the X and y:

X = df5.drop('closed_deal',axis=1)
y = df5['closed_deal']

X is everything but the target value. y is the target value. As always, perform your train, test split. You will notice that the ‘stratify’ parameter has been added to the train, test split. The ‘stratify’ parameter is a great tool for imbalanced dataset because it helps maintain equal proportions of X and y between the train and test data. It is best to ensure that one set does not monopolize the minority target:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101, stratify = y)

Sklearn has useful one-hot and label encoding functions. After splitting the data, we can actually use our data preparation functions:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

The very basic model can be instantiated and fitted to the Xtrain and ytrain:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()

model.fit(X_train_enc, y_train_enc)

y_pred = model.predict(X_test_enc)

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_enc, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

>> 0.8874727398312781

The basic model does ok!, but not as great as the even more basic SupportVectorClassifier. With a baseline established, hyper-parameter tuning can be completed using GridSearchCV:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

# A sample parameter

parameters = {
    "loss":["deviance"],
    "learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.5],
    "min_samples_split": np.linspace(0.1, 1.0, 5),
    "min_samples_leaf": np.linspace(0.1, 0.5, 5,endpoint=True),
    "min_weight_fraction_leaf": np.linspace(0.1, 1.0, 10),
    "max_depth":[3,5,8],
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse"],
    "subsample":[0.8, 0.9, 0.95, 1.0],
    "n_estimators":[10]
    }
#passing the scoring function in the GridSearchCV
grid = GridSearchCV(GradientBoostingClassifier(verbose=2), parameters, cv=3, n_jobs=-1)

grid.fit(X_train_enc,y_train_enc)

After the GridSearchCV the best parameters and scores can be collected:

print(grid.best_score_)
print(grid.best_params_)

>> 0.7847310912445011
{'criterion': 'friedman_mse', 'learning_rate': 0.5, 'loss': 'deviance', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 0.1, 'min_samples_split': 0.55, 'min_weight_fraction_leaf': 0.1, 'n_estimators': 10, 'subsample': 0.9}

Strangely enough, the outputs were worse than the baseline. If I am honest, I have no idea exactly why this is occurring. an initial hypothesis is that we trained the grid on the training set and not the whole set. Or, the grid doesn’t have all of the defaults of the GradientBoostedClassifer. Somewhat discouraged, I moved on to a Tensorflow Sequential Model (Artificial Neural Network)—which I thought was more fun to play with anyways.

Artificial Neural Network Application—Much Better

Since we already defined how we were going to prep the data for the model, the code will not be restated below. Development of the model will come first:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.callbacks import EarlyStopping
from keras.regularizers import l2

early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

model = Sequential()

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

model.add(Dense(units=1350,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(units=676,activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dropout(0.25))

model.add(Dense(units=338,activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dropout(0.125))

model.add(Dense(units=1,activation='sigmoid'))

# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

A few notes here:

We are going to use 4 layers here since the model is pretty complex and lots of features—over 1,300! Thanks to one-hot encoding
- The input layer will have 1,350 nodes—generally, this can be set to the number of features or columns. A drop out layer has been added to dispense of useless nodes
- Two hidden layers have been added. I have halved the nodes, added a regularizer to manage overfitting. The dropout layer is halved as well
- Relu activation is used in the first layers because of a general consensus that relu is quite flexible. If we wanted to tune these values we could later
- In the final hidden layer all the values, except for regularization, are halved to continue to simplify the model
- Finally, the output layer will be a single sigmoid node due to our problem being a binary classification problem

We will be adding early stopping to ensure we do not over fit
Our loss function is going to use binary_crossentropy since this is a binary classification problem and this loss function should be appropriate
The optimizer to be used is the adam optimizer—highly flexible and generally works quite well
This initial model was set up somewhat arbitrarily and should be tuned if it is to be used in some sort of production application

Next, we can fit the model:

model.fit(x=X_train_enc, 
          y=y_train_enc, 
          epochs=1000,
          validation_data=(X_test_enc, y_test_enc), verbose=1,
          callbacks=[early_stop]
          )

1000 epochs is going to be overkill, but the early stopping will ensure we never get close to 1000 epochs.

Epoch 1/1000
305/305 [==============================] - 8s 26ms/step - loss: 0.5662 - val_loss: 0.2128
Epoch 2/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.1535 - val_loss: 0.1206
Epoch 3/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0809 - val_loss: 0.1024
Epoch 4/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0564 - val_loss: 0.1522
Epoch 5/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0412 - val_loss: 0.0698
Epoch 6/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0316 - val_loss: 0.0833
Epoch 7/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0317 - val_loss: 0.0997
Epoch 8/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0246 - val_loss: 0.1217
Epoch 9/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0187 - val_loss: 0.0735
Epoch 10/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0160 - val_loss: 0.1116
Epoch 11/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0166 - val_loss: 0.0773
Epoch 12/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0307 - val_loss: 0.1051
Epoch 13/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0242 - val_loss: 0.1281
Epoch 14/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0201 - val_loss: 0.1222
Epoch 15/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0155 - val_loss: 0.0852
Epoch 00015: early stopping

Learning in 15 epochs is pretty good! Next, we can show the losses:

model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

I am pretty happy with the losses in the chart. Our scale is quite small even though the validation loss is not perfect. It is quite ‘chunky’. I believe this behavior occurs when adding dropout layers. With additional hyper-parameters, the gaps between training and validation loss could be further reduced.

Finally, we can get the predictions and determine performance:

predictions = model.predict_classes(X_test_enc)

# https://en.wikipedia.org/wiki/Precision_and_recall
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

       False       1.00      0.97      0.98      1432
        True       0.96      1.00      0.98      1008

    accuracy                           0.98      2440
   macro avg       0.98      0.98      0.98      2440
weighted avg       0.98      0.98      0.98      2440

Cool! The performance is quite good and without a ton of time spent training. By far, the neural network model gives a bigger bang for the buck.

Next time, we will fine-tune the parameters for the neural network.

ArticlesTyler BetthauserFebruary 20, 2021ConaxonMachine Learning, Neural Network, artificial intelligence, Olist, ANN, Tensorflow, Keras, hyper-parameter tuning, gridsearchcv, accuracy, losss, adam, optimizer, nodes, regularization