Predict which Sales Leads Close Part 2
Introduction
Last time we left off on this project, the technologies we chose weren’t particularly great at predicting one of the classes: Closed Lead versus Open Lead. In this article, a few different methods are employed to overcome the challenges of imbalanced classes, encoding all of the categorical variables, and hyper-parameter tuning.
Gradient-Boosted / Ensemble Algorithms Might Help
Firstly, we will import the GradientBoostingClassifier from sklearn. Catboost and XGBoost were considered for this investigation, but these algorithms are a little harder and more complicated to implement. Sklearn seemed to be more familiar and easier to understand and tune. This is not to say that Catboost and XGBoost are not good solutions—in literature, they are great!
from sklearn.ensemble import GradientBoostingClassifier
Since the feature engineering and data cleaning have already been completed, we can create a dataframe that includes all of the features we want to include in the model:
df5 = funnel_model[['landing_page_id', 'origin', 'sdr_id','sr_id','business_segment', 'lead_type','lead_behaviour_profile','has_gtin','business_type', 'contact_day','contact_month','contact_year','sdr_sr','closed_deal']].copy()
After creating the dataframe as the backbone of the model, we are going to employ the first potential fix for the imbalanced classes. The option we chose to use is called over-sampling. Over-sampling is when you duplicate records from the minority class. Because the algorithm treats each record as a unique instance, duplicated records aren’t a problem and help us synthetically enhance our dataset. Since the minority class is a closed lead, we will over-sample this class:
closed_dup = df5['closed_deal'] == True df_try = df5[closed_dup] df5 = df5.append([df_try]*3,ignore_index=True) df5.shape
This small piece of code will increase each of the closed deal records three times. Something to note: over/under sampling should not be the first line of defence. There are a myriad of other process-based fixes that should be used first such as:
Finding more data to build predictions
Investigate if bias is being introduced to the data collection in some way. Correct or reduce the level of bias in data collection
Gain some domain knowledge on the who, what, where, why, and how of the operations that build this data—make corrections to the prediction methodology where appropriate
Next, functions will be prepared to one-hot encode the features and label encode the labels:
# prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder(handle_unknown='ignore') ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc
Notice what is done inside the encoding functions. Fit to the training then transform the training and test set. Next, define the X and y:
X = df5.drop('closed_deal',axis=1) y = df5['closed_deal']
X is everything but the target value. y is the target value. As always, perform your train, test split. You will notice that the ‘stratify’ parameter has been added to the train, test split. The ‘stratify’ parameter is a great tool for imbalanced dataset because it helps maintain equal proportions of X and y between the train and test data. It is best to ensure that one set does not monopolize the minority target:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101, stratify = y)
Sklearn has useful one-hot and label encoding functions. After splitting the data, we can actually use our data preparation functions:
from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
The very basic model can be instantiated and fitted to the Xtrain and ytrain:
from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier() model.fit(X_train_enc, y_train_enc) y_pred = model.predict(X_test_enc) from sklearn.metrics import roc_curve, auc false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_enc, y_pred) roc_auc = auc(false_positive_rate, true_positive_rate) roc_auc >> 0.8874727398312781
The basic model does ok!, but not as great as the even more basic SupportVectorClassifier. With a baseline established, hyper-parameter tuning can be completed using GridSearchCV:
from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import GridSearchCV from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import make_scorer # A sample parameter parameters = { "loss":["deviance"], "learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.5], "min_samples_split": np.linspace(0.1, 1.0, 5), "min_samples_leaf": np.linspace(0.1, 0.5, 5,endpoint=True), "min_weight_fraction_leaf": np.linspace(0.1, 1.0, 10), "max_depth":[3,5,8], "max_features":["log2","sqrt"], "criterion": ["friedman_mse"], "subsample":[0.8, 0.9, 0.95, 1.0], "n_estimators":[10] } #passing the scoring function in the GridSearchCV grid = GridSearchCV(GradientBoostingClassifier(verbose=2), parameters, cv=3, n_jobs=-1) grid.fit(X_train_enc,y_train_enc)
After the GridSearchCV the best parameters and scores can be collected:
print(grid.best_score_) print(grid.best_params_) >> 0.7847310912445011 {'criterion': 'friedman_mse', 'learning_rate': 0.5, 'loss': 'deviance', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 0.1, 'min_samples_split': 0.55, 'min_weight_fraction_leaf': 0.1, 'n_estimators': 10, 'subsample': 0.9}
Strangely enough, the outputs were worse than the baseline. If I am honest, I have no idea exactly why this is occurring. an initial hypothesis is that we trained the grid on the training set and not the whole set. Or, the grid doesn’t have all of the defaults of the GradientBoostedClassifer. Somewhat discouraged, I moved on to a Tensorflow Sequential Model (Artificial Neural Network)—which I thought was more fun to play with anyways.
Artificial Neural Network Application—Much Better
Since we already defined how we were going to prep the data for the model, the code will not be restated below. Development of the model will come first:
import tensorflow as tf from tensorflow import keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Activation,Dropout from tensorflow.keras.callbacks import EarlyStopping from keras.regularizers import l2 early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10) model = Sequential() # https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw model.add(Dense(units=1350,activation='relu')) model.add(Dropout(0.5)) model.add(Dense(units=676,activation='relu', kernel_regularizer=l2(0.001))) model.add(Dropout(0.25)) model.add(Dense(units=338,activation='relu', kernel_regularizer=l2(0.001))) model.add(Dropout(0.125)) model.add(Dense(units=1,activation='sigmoid')) # For a binary classification problem model.compile(loss='binary_crossentropy', optimizer='adam')
A few notes here:
We are going to use 4 layers here since the model is pretty complex and lots of features—over 1,300! Thanks to one-hot encoding
The input layer will have 1,350 nodes—generally, this can be set to the number of features or columns. A drop out layer has been added to dispense of useless nodes
Two hidden layers have been added. I have halved the nodes, added a regularizer to manage overfitting. The dropout layer is halved as well
Relu activation is used in the first layers because of a general consensus that relu is quite flexible. If we wanted to tune these values we could later
In the final hidden layer all the values, except for regularization, are halved to continue to simplify the model
Finally, the output layer will be a single sigmoid node due to our problem being a binary classification problem
We will be adding early stopping to ensure we do not over fit
Our loss function is going to use binary_crossentropy since this is a binary classification problem and this loss function should be appropriate
The optimizer to be used is the adam optimizer—highly flexible and generally works quite well
This initial model was set up somewhat arbitrarily and should be tuned if it is to be used in some sort of production application
Next, we can fit the model:
model.fit(x=X_train_enc, y=y_train_enc, epochs=1000, validation_data=(X_test_enc, y_test_enc), verbose=1, callbacks=[early_stop] )
1000 epochs is going to be overkill, but the early stopping will ensure we never get close to 1000 epochs.
Epoch 1/1000 305/305 [==============================] - 8s 26ms/step - loss: 0.5662 - val_loss: 0.2128 Epoch 2/1000 305/305 [==============================] - 7s 24ms/step - loss: 0.1535 - val_loss: 0.1206 Epoch 3/1000 305/305 [==============================] - 7s 23ms/step - loss: 0.0809 - val_loss: 0.1024 Epoch 4/1000 305/305 [==============================] - 7s 24ms/step - loss: 0.0564 - val_loss: 0.1522 Epoch 5/1000 305/305 [==============================] - 8s 25ms/step - loss: 0.0412 - val_loss: 0.0698 Epoch 6/1000 305/305 [==============================] - 7s 23ms/step - loss: 0.0316 - val_loss: 0.0833 Epoch 7/1000 305/305 [==============================] - 8s 25ms/step - loss: 0.0317 - val_loss: 0.0997 Epoch 8/1000 305/305 [==============================] - 7s 23ms/step - loss: 0.0246 - val_loss: 0.1217 Epoch 9/1000 305/305 [==============================] - 8s 25ms/step - loss: 0.0187 - val_loss: 0.0735 Epoch 10/1000 305/305 [==============================] - 7s 24ms/step - loss: 0.0160 - val_loss: 0.1116 Epoch 11/1000 305/305 [==============================] - 8s 25ms/step - loss: 0.0166 - val_loss: 0.0773 Epoch 12/1000 305/305 [==============================] - 7s 24ms/step - loss: 0.0307 - val_loss: 0.1051 Epoch 13/1000 305/305 [==============================] - 7s 24ms/step - loss: 0.0242 - val_loss: 0.1281 Epoch 14/1000 305/305 [==============================] - 7s 24ms/step - loss: 0.0201 - val_loss: 0.1222 Epoch 15/1000 305/305 [==============================] - 7s 23ms/step - loss: 0.0155 - val_loss: 0.0852 Epoch 00015: early stopping
Learning in 15 epochs is pretty good! Next, we can show the losses:
model_loss = pd.DataFrame(model.history.history) model_loss.plot()
I am pretty happy with the losses in the chart. Our scale is quite small even though the validation loss is not perfect. It is quite ‘chunky’. I believe this behavior occurs when adding dropout layers. With additional hyper-parameters, the gaps between training and validation loss could be further reduced.
Finally, we can get the predictions and determine performance:
predictions = model.predict_classes(X_test_enc) # https://en.wikipedia.org/wiki/Precision_and_recall print(classification_report(y_test,predictions)) precision recall f1-score support False 1.00 0.97 0.98 1432 True 0.96 1.00 0.98 1008 accuracy 0.98 2440 macro avg 0.98 0.98 0.98 2440 weighted avg 0.98 0.98 0.98 2440
Cool! The performance is quite good and without a ton of time spent training. By far, the neural network model gives a bigger bang for the buck.
Next time, we will fine-tune the parameters for the neural network.