Many classification and regression models are capable of creating complex relationships (adaptable). However, all of them can very easily overemphasize patterns that are not reproducible (such as you seem to observe a face hidden on a cup of coffee). This is a big problem because the modeler do not know anything about the problem until the next set of samples are predicted. Therefore, a methodological approach to evaluating models is required. In fact, over-fitting is a concern for any predictive model regardless of field of research.
I’m going to describe the main strategies that enable us to have confidence that the model will predict new samples with a similar degree of accuracy on the set of data for which the model was evaluated. As you probably know, the model’s predictions become useless without this confidence.
On a practical note, all model building efforts are constrained by the existing data. For many problems, the data may have a limited number of samples, may be of less-than-desirable quality, and/or may be unrepresentative of future samples. While there are ways to build predictive models on small data sets, we must assume that data quality is sufficient and it’s representative of the entire sample population.
Under these assumptions, we use the data at hand to find the best predictive model. Almost all predictive modeling techniques have tuning parameters that enable the model to flex to find the structure in the data. Hence, we use the existing data to identify settings for the model’s parameters that yield the best and most realistic predictive performance (known as model tuning).
Traditionally, this has been achieved by splitting the existing data into training and test sets.
- The training set is used to build and tune the model and
- the test set is used to estimate the model’s predictive performance.
Modern approaches to model building split the data into multiple training and testing sets, which have been shown to often find more optimal tuning parameters and give a more accurate representation of the model’s predictive performance.
To avoid over-fitting, we propose a general model building approach that encompasses model tuning and model evaluation with the ultimate goal of finding the reproducible structure in the data. This approach entails splitting existing data into distinct sets for the purposes of tuning model parameters and evaluating model performance. The choice of data splitting method depends on characteristics of the existing data such as its size and structure.
There are many techniques that can learn the structure of a set of data so well that when the model is applied to the data on which the model was built, it “correctly” predicts every sample of the training data set. In addition to learning the general patterns in the data, the model has also learned the characteristics of each sample’s unique noise (of the training set of data). This type of model is said to be over-fit and will usually have poor accuracy when predicting a new sample (that is, of the test training data set).
Many models have important parameters which cannot be directly estimated from the data. For example, in the K-nearest neighbor classification model (KNN’s), a new sample is predicted based on the K-closest data points in the training set.
In the previous image, two new samples (denoted by the solid dot and filled triangle) are being predicted.
- One sample (•) is near a mixture of the two classes; three of the five neighbors indicate that the sample should be predicted as the first class.
- The other sample (<) has all five points indicating the second class should be predicted.
The question here is the next one: how many neighbors should be used? A choice of too few neighbors may over-fit the individual points of the training set while too many neighbors may not be sensitive enough to yield reasonable performance.
This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.
Do you know if your model is overfitted? How to know it? How to solve this problem without losing a reasonable performance of your model?