There are a lot of decisions to make when designing and configuring a deep learning model. Most of these decisions must be resolved empirically through trial and error and evaluating them on real data.
This includes high-level decisions like the number, size and type of layers in your network. It also includes the lower level decisions like the choice of loss function, activation functions, optimizer batch size, and number of epochs.
As such, it is critically important to have a robust way to evaluate and diagnose the performance of your neural networks and deep learning models.
Sometimes we large amounts of data and the complexity of the models require very long training times. As such, it is typical to use a simple separation of data into training and test datasets or training and validation datasets.
Keras provides $2$ convenient ways of evaluating your deep learning models this way:
Keras can separate a portion of your training data into a validation dataset and evaluate the performance of your model on that validation dataset each epoch. You can do this by setting the validation_split
argument on the fit()
function to a percentage of the size of your training dataset.
For example, a reasonable value might be $0.2$ or $0.33$ for $20\%$ or $33\%$ of your training data held back for validation.
The example below demonstrates the use of using an automatic validation dataset on a small binary classification problem. For this we use the Pima Indians diabetes dataset.
import numpy as np
# Fix random seed for reproducibility
np.random.seed(0)
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
dataset = np.loadtxt(url, delimiter=",")
print(dataset.shape)
print(dataset[:5])
# Split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]
# Import Keras
from keras.models import Sequential
from keras.layers import Dense
# Create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, y, validation_split=0.33, epochs=150, batch_size=10, verbose=1)
You can see that the verbose output on each epoch shows the loss and accuracy on both the training dataset and the validation dataset.
Keras also allows you to manually specify the dataset to use for validation during training.
In this example we use the train_test_split()
function from the Python scikit-learn
to separate our data into a training and test dataset. We use $67\%$ for training and the remaining $33\%$ of the data for validation.
The validation dataset can be specified to the fit()
function in Keras by the validation_data
argument. It takes a tuple of the input and output datasets.
from sklearn.model_selection import train_test_split
# Split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10, verbose=1)
The gold standard for machine learning model evaluation is k-fold cross validation. It provides a robust estimate of the performance of a model on unseen data. It does this by splitting the training dataset into $k$ subsets and takes turns training models on all subsets except one which is held out, and evaluating model performance on the held out validation dataset. The process is repeated until all subsets are given an opportunity to be the held out validation set. The performance measure is then averaged across all models that are created.
Cross validation is often not used for evaluating deep learning models because of the larger computational expense. For example k-fold cross validation is often used with $5$ or $12$ folds. As such, $5$ or $12$ models must be constructed and evaluated, greatly adding to the evaluation time of a model.
Nevertheless, when the problem is small enough or if you have sufficient compute resources, k-fold cross validation can give you a less biased estimate of the performance of your model.
In the example below we use the StratifiedKFold
class from the scikit-learn
to split up the training dataset into 10 folds. The folds are stratified, meaning that the model attempts to balance the number of instances of each class in each fold.
The example creates and evaluates $10$ models using the $10$ splits of the data and collects all of the scores. The verbose output for each epoch is turned off by passing verbose=0
to the fit()
and evaluate()
functions on the model.
The performance is printed and stored for each model. The average and standard deviation of the model performance is then printed at the end of the run to provide a robust estimate of model accuracy.
from sklearn.model_selection import StratifiedKFold
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=2)
cnt = 0
cvscores = []
for train, test in kfold.split(X, y):
cnt+=1
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X[train], y[train], epochs=150, batch_size=10, verbose=0)
# evaluate the model
scores = model.evaluate(X[test], y[test], verbose=0)
print(f'Fold {cnt} {model.metrics_names[1]}: {round(scores[1]*100,3)}%')
cvscores.append(scores[1] * 100)
print(f'\n{model.metrics_names[1]}: {round(np.mean(cvscores),3)}%, (+/- {round(np.std(cvscores),3)})')
The Keras library provides a way to calculate and report on a suite of standard metrics when training deep learning models.
In addition to offering standard metrics for classification and regression problems, Keras also allows you to define and report on your own custom metrics when training deep learning models. This is particularly useful if you want to keep track of a performance measure that better captures the skill of your model during training.
Keras allows you to list the metrics to monitor during the training of your model.
You can do this by specifying the metrics
argument and providing a list of function names (or function name aliases) to the compile()
function on your model.
The specific metrics that you list can be the names of Keras functions (like mean_squared_error
) or string aliases for those functions (like mse
).
Metric values are recorded at the end of each epoch on the training dataset. If a validation dataset is also provided, then the metric recorded is also calculated for the validation dataset.
All metrics are reported in verbose output and in the history object returned from calling the fit()
function. In both cases, the name of the metric function is used as the key for the metric values. In the case of metrics for the validation dataset, the “val_” prefix is added to the key.
Both loss functions and explicitly defined Keras metrics can be used as training metrics.
Metrics that you can use in Keras on regression problems.
The example below demonstrates these 4 built-in regression metrics on a simple contrived regression problem.
import matplotlib.pyplot as plt
# prepare sequence
X = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
# create model
model = Sequential()
model.add(Dense(2, input_dim=1))
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae', 'mape', 'cosine'])
# train model
history = model.fit(X, X, epochs=500, batch_size=len(X), verbose=0)
keys = list(history.history.keys())
# plot metrics
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(13,4))
ax[0].plot(history.history[keys[1]], label=keys[1])
ax[0].plot(history.history[keys[2]], label=keys[2])
ax[1].plot(history.history[keys[3]], label=keys[3], color='g')
ax[2].plot(history.history[keys[4]], label=keys[4], color='r')
ax[0].legend()
ax[1].legend()
ax[2].legend()
plt.tight_layout()
plt.show()
List of metrics that you can use in Keras for classification problems.
Regardless of whether your problem is a binary or multi-class classification problem, you can specify the accuracy
metric to report on accuracy.
Below is an example of a binary classification problem with the built-in accuracy metric demonstrated.
# Prepare sequence
X = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
# create model
model = Sequential()
model.add(Dense(2, input_dim=1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# train model
history = model.fit(X, y, epochs=400, batch_size=len(X), verbose=0)
# plot metrics
plt.plot(history.history['accuracy'])
plt.show()
You can also define your own metrics and specify the function name in the list of functions for the metrics
argument when calling the compile()
function.
A metric for regression not available in the API that is good to keep track of is Root Mean Square Error, or RMSE.
You can get an idea of how to write a custom metric by examining the code for an existing metric.
import keras.backend as k
# Example for the mean_squared_error loss function and metric in Keras.
def mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
$K$ is the backend used by Keras.
From this example and other examples of loss functions and metrics, the approach is to use standard math functions on the backend to calculate the metric of interest.
For example, we can write a custom metric to calculate RMSE as follows:
def rmse(y_true, y_pred):
return k.sqrt(k.mean(k.square(y_pred - y_true), axis=-1))
You can see the function is the same code as MSE with the addition of the sqrt()
wrapping the result.
We can test this in our regression example as follows. Note that we simply list the function name directly rather than providing it as a string or alias for Keras to resolve.
# prepare sequence
X = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
# create model
model = Sequential()
model.add(Dense(2, input_dim=1, activation='relu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam', metrics=[rmse])
# train model
history = model.fit(X, X, epochs=500, batch_size=len(X), verbose=0)
# plot metrics
plt.plot(history.history['rmse'], label='RMSE')
plt.legend()
plt.show()
A learning curve is a plot of model learning performance over experience or time.
Learning curves are a widely used diagnostic tool in machine learning for algorithms that learn from a training dataset incrementally. The model can be evaluated on the training dataset and on a hold out validation dataset after each update during training and plots of the measured performance can created to show learning curves.
Reviewing learning curves of models during training can be used to diagnose problems with learning, such as an underfit or overfit model, as well as whether the training and validation datasets are suitably representative.
Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis.
Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning.
Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing.
The shape and dynamics of a learning curve can be used to diagnose the behavior of a machine learning model and in turn perhaps suggest at the type of configuration changes that may be made to improve learning and/or performance.
The $3$ common dynamics that you are likely to observe in learning curves:
We will take a closer look at each with examples. The examples will assume that we are looking at a minimizing metric, meaning that smaller relative scores on the y-axis indicate better learning.
Underfitting refers to a model that cannot learn the training dataset.
An underfit model can be identified from the learning curve of the training loss only.
It may show a flat line or noisy values of relatively high loss, indicating that the model was unable to learn the training dataset at all.
An example of this is provided below and is common when the model does not have a suitable capacity (nodes and layers) for the complexity of the dataset.
An underfit model may also be identified by a training loss that is decreasing and continues to decrease at the end of the plot.
This indicates that the model is capable of further learning and possible further improvements and that the training process was halted prematurely.
A plot of learning curves shows underfitting if:
Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.
The problem with overfitting, is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. This increase in generalization error can be measured by the performance of the model on the validation dataset.
This often occurs if the model has more capacity (layers and neurons) than is required for the problem, and, in turn, too much flexibility. It can also occur if the model is trained for too long.
A plot of learning curves shows overfitting if:
The inflection point in validation loss may be the point at which training could be halted as experience after that point shows the dynamics of overfitting.
The example plot below demonstrates a case of overfitting.
A good fit is the goal of the learning algorithm and exists between an overfit and underfit model.
A good fit is identified by a training and validation loss that decreases to a point of stability with a minimal gap between the two final loss values.
The loss of the model will almost always be lower on the training dataset than the validation dataset. This means that we should expect some gap between the train and validation loss learning curves. This gap is referred to as the generalization gap
.
A plot of learning curves shows a good fit if:
Continued training of a good fit will likely lead to an overfit.
The example plot below demonstrates a case of a good fit.
Learning curves can also be used to diagnose properties of a dataset and whether it is relatively representative.
An unrepresentative dataset means a dataset that may not capture the statistical characteristics relative to another dataset drawn from the same domain, such as between a train and a validation dataset. This can commonly occur if the number of samples in a dataset is too small, relative to another dataset.
There are two common cases that could be observed; they are:
An unrepresentative training dataset means that the training dataset does not provide sufficient information to learn the problem, relative to the validation dataset used to evaluate it.
This may occur if the training dataset has too few examples as compared to the validation dataset.
This situation can be identified by a learning curve for training loss that shows improvement and similarly a learning curve for validation loss that shows improvement, but a large gap remains between both curves.
Example of Train and Validation Learning Curves Showing a Training Dataset That May Be too Small Relative to the Validation Dataset
An unrepresentative validation dataset means that the validation dataset does not provide sufficient information to evaluate the ability of the model to generalize.
This may occur if the validation dataset has too few examples as compared to the training dataset.
This case can be identified by a learning curve for training loss that looks like a good fit (or other fits) and a learning curve for validation loss that shows noisy movements around the training loss.
Example of Train and Validation Learning Curves Showing a Validation Dataset That May Be too Small Relative to the Training Dataset
It may also be identified by a validation loss that is lower than the training loss. In this case, it indicates that the validation dataset may be easier for the model to predict than the training dataset.
Example of Train and Validation Learning Curves Showing a Validation Dataset That Is Easier to Predict Than the Training Dataset