Predictive Classifiers

A Standard Custom Interface for Common sklearn Classification Models

Background, initial motivation, and development data set

I wanted to apply to classification models the same methodology I used to create a customized interface for heatmap plots . Similarly to the creation of heatmaps, the process of instantiating and training models tends to use the same steps (with some modifications) over and over, and after training there is a similarly stereotyped set of things one tends to do with the model. The big difference is that instead of a one-time-use function for outputting a plot, it made more sense to me to implement these classifiers as custom Python objects—each essentially a wrapper for a particular scikit-learn classifier, with a more streamlined initiation procedure and some convenient methods added. For development purposes I intially focused on the RandomForestClassifier from sklearn.ensemble.

The MNIST data set of images of handwritten digits is a benchmark for a basic classification algorithm. It consists of 60,000 training images and 10,000 test images, each of a handwritten digit from 0 to 9. Here are the first three hundred thirty-six images in the training set, stitched together for display:

Each image is 28x28 pixels, and is encoded as an array of grayscale values ranging from 0 (white) to 255 (black) for each pixel. These properties of the data set allow it to serve as a paradigmatic example of a broad type of machine learning classification problem: given an input data point with a potentially very large number of numerical variables, output a single categorical label, chosen from one of a relatively small number of possibilities. For these data, a functional model will have 'learned', based on the training set, patterns of relationships among the seven hundred eighty-four pixel values that allow it to choose the most likely digit to assign to a given test image.

Data Preparation

In principle, a classifier can be trained on a data set with any number of scalar columns containing numerical values, and exactly one category column indicating class membership. Similarly to the behavior of my custom heatmap interface, calling the classifier requires a dataframe input, and allows for optional categ_col and scalar_cols arguments. The latter must be a list of column names in the dataframe, while the former is either a single column name, or a separate series or 1D array of categories, with an entry for each row of the dataframe. If categ_col is None, it's assumed that the last column of the dataframe is the category column. Unless scalar_cols has been specified, the names of all columns that haven't been designated the category column are then considered to be scalar columns.

In practice, for algorithm performance it's often necessary to condition the data to make sure that there aren't columns with a signficantly greater or smaller magnitude of values than the others (this is more likely to be an issue when combining columns that have different units and different inherent numerical ranges unlike this pixel data, but it's important to keep in mind). The default behavior is to independently rescale each column by finding its minimum and maximum values in the training data, subtracting the minimum from each value in the column, and dividing each value by the range between maximum and minimum, thus forcing each column to range between 0 and 1. For this example, I've explictly shown the same operations the classifer would normally be capable of doing invisibly. This behavior can be suppressed when initializing the classifier by setting the rescale parameter to False, which sets the deduction and divisor for each column to 0 and 1.

Initial classifier attributes and methods

The rescale values for each column are saved in an attribute of the classifier, classifier.rescale_factors, so that when a new data set is fed to it, each column will be rescaled with the same parameters as the training set. The set of scalar input column names (classifier.scalar_cols) and the list of possible class labels (classifier.class_labels) are also saved. The column names must all be present in any test data so they can be fed into the model, and the class labels are used to determine the possible range of predictions. The training data and labels used to create the classifier are likewise stored in classifier.X_train and classifier.y_train, while the underlying trained sklearn model can be accessed as classifier.model.

Each classifier object also comes with nine explicitly defined methods that interact with this set of attributes and may modify them or create others. The classifier.predict method takes a dataframe of test data as its argument and uses the trained model to return a series of of the most likely label for each data point. When predict is called, the test set and the predictions are saved as attributes of the classifier, classifier.X_test and classifier.y_pred. If the method is called without an argument, it uses the saved value of X_test as input.

The classifier.predict_proba method has similar behavior. It returns a dataframe showing the probability of assigning each of the possible class labels to each of the test data points, and stores that result in classifier.y_pred_proba. Then classifier.y_pred is updated based on these probabilities—each test data image is assigned the label with the highest probability—to make sure these two attributes both agree with the last input and each other. For the same reason, if predict is called directly after y_pred_proba has been defined, the attribute will be deleted, since it's now obsolete.

Confusion matrices

There are three methods that calculate a test data set's confusion matrix, the grid of counts of actual class assignments for each predicted assignment. Each saves the result as a dataframe in classifier.cm, and each will retrieve the confusion matrix from this attribute if called without input test data. If test data is supplied, it must include, either as a column of the test dataframe with the same name as classifier.y_train or as a separate input parameter, the true categories, which are then stored as classifer.y_test. Then classifier.predict is called with the test dataframe, updating classifier.X_test and classifier.y_pred in the process. It's after constructing the confusion matrix from y_test and y_pred that the three methods behave differently.

classifier.confusion_matrix simply returns the dataframe, while classifier.confusion_matrix_plot returns nothing, but automatically creates a heatmap plot from the matrix. A filepath can be provided to save the image, or, in the absence of this savepath, plt.show is called. Finally, classifier.confusion_metrics calculates the test accuracy from the sum of the on-diagonal elements of the matrix, divided by the total sum, and the error rate by subtracting the accuracy from 1. These two metrics are returned together as a dictionary, though not saved, since they can be so easily recalculated from classifier.cm

Binary classifiers

Binary classifiers, those with only two class labels, have some slightly different behavior with respect to classifier.confusion_metrics. First of all, when a classifier is initiated using a data set with only two classes, a classifier.pos_label attribute is created. The category label to consider positive can be specified by the user, or it will be automatically chosen to be the second label found in the training category column. The positive label assignment is needed for the calculations specific to binary classifiers. These include four additional metrics—sensitivity, specificity, positive predictive value, and false positive rate—that are derived from the 2x2 confusion matrix and returned along with the accuracy and error rate.

The classifier.roc_auc method calculates and returns the receiver operating characteristic (ROC) area under the curve. The ROC curve plots sensitivity against false positive rate for varying probability thresholds, so classifer.predict_proba is internally called on the supplied test data set, creating or updating the usual attributes discussed above. At the same time, a plot of the ROC curve is generated and returned, either saving as an image to a specified destination savepath, or being presented via plt.show. The area is saved in the attribute classifier.roc_area, while the calculated points on the curve used for the plot are saved as a dataframe, classifier.roc_points. This dataframe actually has three columns: roc_points.fpr and roc_points.tpr are the horizontal and vertical coordinates for each point on the curve, while roc_points.threshold are the probability thresholds for which each pair of false positive rate and true positive rate was calculated. If called without a test data set, classifier.roc_auc doesn't make any calculations, but returns the saved classifier.roc_area and uses the saved classifier.roc_points to make a plot. The method is available only for binary classifiers; if there are anything other than two classes the method will return nothing and do nothing.

Saving and recalling a classifier

To take advantage of the portability gained by wrapping all the data and metrics associated with a model in a single object, I also added a save method, which leverages the joblib library. The method takes a destination filepath as the argument. After saving, a classifier can be recalled later in a new environment by calling random_forest_classifier. When the argument is a string instead of a dataset, the initialization routine knows to interpret that as a filepath and reload the saved classifier.

Because training and test data will likely have been saved separately, it may not be desirable for them to needlessly take up space by also being saved as part of the classifier. For convenience, the classifier.clear_data method will remove classifer.X_train, classifier.y_train, classier.X_test, classifier.y_test, classifier.y_pred, and classifier.y_pred_proba. Any attribute can also be removed individually by name using the classifer.del_attr method.

Conclusion and other notes

After going through it once with sklearn.ensemble.RandomForestClassifier, it was easy to apply the same procedure to create a wrapper for sklearn.linear_model.LogisticRegression, sklearn.ensemble.GradientBoostingClassifier, sklearn.svm.LinearSVC/sklearn.svm.SVC, and sklearn.linear_model.RidgeClassifier. As with random_forest_classifier, these classes—log_reg_classifier, grad_boost_classifier, svm_classifier, and ridge_reg_classifier—take training data as a mandatory argument when called, but also have optional arguments and defaults which correspond to each of the parameters explicitly listed in the docstring of the sklearn models, to which they are passed, as well as the sample_weight parameter, which is passed to the fit method of the model along with the training data. This interface will therefore make it easy to loop through many model iterations comparing performance as parameters are systematically changed within the same model type, as well as across multiple types.

The svm_classifier has a special Boolean linear argument, True by default, to choose whether to use sklearn.svm.LinearSVC or sklearn.svm.SVC as the model. The differences are explained in more detail in the scikit-learn documentation, but the upshot is that the linear version allows only hyperplanes as boundaries for the hyperspace regions corresponding to each category, while the general model allows for arbitrary boundary shapes. This means the latter is significantly more data intensive than the former, taking over seventy minutes to train with the MNIST set on my Macbook Pro with all other settings default. Of the others, only grad_boost_classifier takes anywhere close to as long, at around twenty-two minutes. The others, including svm_classifier with linear set to True, takes a minute or two at most.

Neither sklearn.linear_model.RidgeClassifier nor sklearn.svm.LinearSVC has a predict_proba method, so ridge_reg_classifier has its predict_proba method removed as well, while calling predict_proba on an svm_classifier using a linear model will raise an error. Since the roc_auc method is no longer able to leverage predict_proba to generate the ROC curve, in these special cases the decision_function method of the sklearn model is called instead, generating confidence scores that play the thresholding role that would otherwise have been filled by prediction probabilities in the ROC calculation process.

Links

ml_utils library including these and other useful objects and methods on github

A pared down version of the previous, with only classifiers

Documentation for each of the underlying sklearn classifiers: random forest, gradient boosted decision trees, logistic regression, ridge regression, linear support vector machine, general support vector machine

An overview of the random forest algorithm

An explanation of the math behind logistic regression

Simple guide to confusion matrix terminology

ROC curves and Area Under the Curve explained (video)