Note: This is the first in a series of posts where we take a deeper dive into the question of data drift detection. We explore not only why it is an important part of model monitoring, but we also discuss regimes and approaches to keep in mind. In the first part of the series, we discuss drift in the context of Tabular data and describe univariate and multivariate techniques for tackling these problems. In the follow-on posts, we’ll dive into unstructured data, such as images and documents, and discuss how we can build data drift detection systems in these more challenging regimes.
Part I: Multivariate Data Drift with Tabular Data
Monitoring the ongoing success of a machine learning model requires monitoring the data coming into that model. This means ensuring the data coming through today looks exactly how you expect it to look. Ultimately, you want to make sure the data looks typical: Does it look the same way it did when the model was first trained? If the data has changed significantly, your trained model is likely stale and resulting in inaccurate predictions. Whether you’re talking about Tabular numeric data, image data, or NLP data, the data monitoring problem remains the same. In all cases, we will have some sense of what the data ought to look like and then alert when things go astray. In technical terminology, this is often referred to as out-of-distribution detection: We want to find when the data no longer adheres to the shape and distribution that it used to (back when the model was trained). There are many ways of thinking about data drift detection, and in this post, we’ll describe the benefits of a high-dimensional and multivariate approach.
A handy approach to begin thinking about data drift detection is to measure the distributional similarity between the data coming through a model now versus how the data is supposed to look, such as in the training set. A great starting approach is to separately look at each input variable to a model (and outputs as well). This so-called univariate drift approach can be tackled with many technical implementations. Common approaches include hypothesis tests, such as KS Test and ChiSquared test, and the so-called f-divergences, such as the KLDivergence, JSDivergence, or similar. Common to all of these approaches, we would typically apply them in a univariate way to each input feature to a model (see Figure 1).
Figure 1: Comparing distributions. In this diagram, we examine a single input feature (Age) and look at the distribution of this variable at two time points: in the training data (green distribution), and in today’s production data (purple distribution). It is clear that the general shape of this distribution has changed quite a bit. This could lead to model inaccuracy.
A higher-dimensional variant can be calculated in theory, but these methods are ineffective in high-dimensional applications due to data sparsity. Primarily, by using univariate measures for drift detection, we make an implicit assumption of feature independence. While this might be approximately true in some cases, most generally, our dataset likely has some complex interactions between features and other significant structures. Importantly, this can lead to missed events when we consider only one feature at a time. Therefore, we must consider the high-dimensional joint distribution of the data.
In a multivariate approach, we fit a multi-dimensional ancillary model to the full joint distribution of the training set. This ancillary model will act as a density model and learn the patterns and structure in the dataset. This model can quantify how much any given datapoint is typical or atypical relative to the reference dataset. In implementation, there are many potential approaches for this density model—examples include things like a Variational Autoencoder, Normalizing Flow Models, Density Models, Isolation Forest, and so on. Any technique which is flexible should be able to work effectively. Then, we can use this learned density model to evaluate future datapoints on how similar they are to the training data.
This approach is explained further in the sketch below, which shows a simplified view of the process. On the left, imagine we have a training dataset; in this case, it entails only a couple of continuous variables (X1 and X2). We have a small number of sample datapoints scattered around this space of X1 and X2. The dataset has some particular patterns and structure to it (the curved relationship between X1 and X2), perhaps unknown to us. In step two, we fit a density model to this dataset. For brevity, we omit implementation details here, but the overall goal is to quantify where in the X1-X2 space we saw lots of data and where we saw no data. That is indicated in this sketch by the shaded contours: the darkest shading suggests areas where we were very likely to see some data, and blank areas show where we didn’t see any data at all. In step 3, we can use this trained density model to score any new datapoints in terms of how likely they would have been seen, per the training data. Another way to think about this is to score a new datapoint based on whether it adheres to the typical shapes/patterns in the training set or if it is abnormal. As an example, one of the datapoints is green because it falls right in line with the “typical” regions of the density model. This datapoint is very similar to other data in the training set. In contrast, the red datapoint is found in a region where none of the training data was ever seen. In this way, this datapoint is an anomaly and is unlike anything in the training set. In technical terminology, this point is said to be “out of distribution” relative to the training set.
In the example sketched here, note that the univariate drift measures would likely fail to notice the anomalous datapoint. When viewed in a univariate sense (against either X1 or X2), this anomalous datapoint is quite typical. However, because X1 and X2 have a complex structure, we find that the red datapoint is quite different from the training data. When we fail to consider the multivariate case, we can miss many subtle shifts where the production data falls off the data manifold.
This form of out-of-distribution detection is an important part of monitoring the health of a machine learning system. It is becoming increasingly important that ML models can have understanding of their own uncertainty and confidence. In many cases, this amounts to uncertainty over its predictions, given an input. However, with out-of-distribution detection, we can understand what the model thinks the world looks like and we can flag when things are quite a bit different. This is useful because complex ML models are often overconfident in their predictions, especially for data that is unlike what they were trained on. By considering whether each input is in-sample or out-of-sample, we can better quantify when to trust a model’s prediction and when to be leerier.
In this post, we introduced ideas about out-of-distribution detection for the context of Tabular data. But this problem is pervasive for all types of machine learning. In our future posts, we will dive into these ideas for computer vision models and NLP models.