The longer we train the network, the more specialized the weights will become to the training data, overfitting the training data. The weights will grow in size in order to handle the specifics of the examples seen in the training data.
Large weights make the network unstable. Although the weight will be specialized to the training dataset, minor variation or statistical noise on the expected inputs will result in large differences in the output.
The learning algorithm can be updated to encourage the network toward using small weights. One way to do this is to change the calculation of loss used in the optimization of the network to also consider the size of the weights.
In calculating the loss between the predicted and expected values in a batch, we can add the current size of all weights in the network or add in a layer to this calculation. This is called a penalty because we are penalizing the model proportional to the size of the weights in the model.
Larger weights result in a larger penalty, in the form of a larger loss score. The optimization algorithm will then push the model to have smaller weights, i.e. weights no larger than needed to perform well on the training dataset.
Smaller weights are considered more regular or less specialized and as such, we refer to this penalty as weight regularization.
Types of weight regularization:
Each requires a hyperparameter that must be configured.
Keras provides a weight regularization API that allows you to add a penalty for weight size to the loss function.
By default, no regularizer is used in any layers.
A weight regularizer can be added to each layer when the layer is defined in a Keras model.
This is achieved by setting the kernel_regularizer
argument on each layer. A separate regularizer can also be used for the bias via the bias_regularizer
argument, although this is less often used.
The regularizers are provided under keras.regularizers
and have the names l1
, l2
and l1_l2
. Each takes the regularizer hyperparameter as an argument. For example:
import keras
keras.regularizers.l1(0.01)
keras.regularizers.l2(0.01)
keras.regularizers.l1_l2(l1=0.01, l2=0.01)
# example of l2 on a dense layer
from keras.regularizers import l1, l2, l1_l2
from keras.layers import Dense
Dense(32, kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01))
Like the Dense layer, the Convolutional layers (e.g. Conv1D and Conv2D) also use the kernel_regularizer
and bias_regularizer
arguments to define a regularizer.
The example below sets an l2 regularizer on a Conv2D convolutional layer:
# example of l2 on a convolutional layer
from keras.layers import Conv2D
Conv2D(32, (3,3), kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01))
Recurrent layers like the LSTM offer more flexibility in regularizing the weights.
The input, recurrent, and bias weights can all be regularized separately via the kernel_regularizer
, recurrent_regularizer
, and bias_regularizer
arguments.
The example below sets an l2 regularizer on an LSTM recurrent layer:
# example of l2 on an lstm layer
from keras.layers import LSTM
LSTM(32, kernel_regularizer=l2(0.01), recurrent_regularizer=l2(0.01), bias_regularizer=l2(0.01))
Embedding layers use the embeddings_regularizer argument to define a regularizer.
The example below sets an l2 regularizer on an Embedding layer:
# example of l2 on an embedding layer
from keras.layers import Embedding
Embedding(input_dim=10, output_dim=5, embeddings_regularizer=l2(0.01))
It can be helpful to look at some examples of weight regularization configurations reported in the literature.
It is important to select and tune a regularization technique specific to your network and dataset, although real examples can also give an idea of common configurations that may be a useful starting point.
MLP Weight Regularization
weight decay
, with values often on a logarithmic scale between $0$ and $0.1$, such as $0.1, 0.001, 0.0001$, etc.CNN Weight Regularization
Weight regularization does not seem widely used in CNN models, or if it is used, its use is not widely reported.
L2 weight regularization with very small regularization hyperparameters such as (e.g. $0.0005$ or $5 x 10^{−4}$) may be a good starting point.
LSTM Weight Regularization
It is common to use weight regularization with LSTM models.
An often used configuration is L2 (weight decay) and very small hyperparameters (e.g. $10^{−6}$). It is often not reported what weights are regularized (input, recurrent, and/or bias), although one would assume that both input and recurrent weights are regularized only.
We will use a standard binary classification problem that defines two semi-circles of observations: one semi-circle for each class.
Each observation has two input variables with the same scale and a class output value of either $0$ or $1$. This dataset is called the “moons” dataset because of the shape of the observations in each class.
We will use the sklearn
make_moons()
function to generate observations for this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.
# generate two moons dataset
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import pandas as pd
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# scatter plot, dots colored by class value
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
grouped = df.groupby('label')
fig, ax = plt.subplots(figsize=(10,5))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key], s=50, alpha=0.5)
plt.show()
This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.
We have only generated $100$ samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.
The model will have one hidden layer with more nodes that may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.
Before we define the model, we will split the dataset into train and test sets, using $30$ examples to train the model and $70$ to evaluate the fit model’s performance.
# overfit mlp for the moons dataset
from keras.models import Sequential
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot history
plt.plot(history.history['accuracy'], label=f'train, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_accuracy'], label=f'test, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
We can see an expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.
We can add weight regularization to the hidden layer to reduce the overfitting of the model to the training dataset and improve the performance on the holdout set.
We will use the L2 vector norm also called weight decay with a regularization parameter (called alpha or lambda) of $0.001$, chosen arbitrarily.
This can be done by adding the kernel_regularizer
argument to the layer and setting it to an instance of l2.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot history
plt.plot(history.history['accuracy'], label=f'train, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_accuracy'], label=f'test, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
As expected, we see the learning curve on the test dataset rise and then plateau, indicating that the model may not have overfit the training dataset.
Once you can confirm that weight regularization may improve your overfit model, you can test different values of the regularization parameter.
It is a good practice to first grid search through some orders of magnitude between $0.0$ and $0.1$, then once a level is found, to grid search on that level.
We can grid search through the orders of magnitude by defining the values to test, looping through each and recording the train and test performance.
# grid search values
values = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
all_train, all_test = list(), list()
for param in values:
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(param)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(trainX, trainy, epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Param: %f, Train: %.3f, Test: %.3f' % (param, train_acc, test_acc))
all_train.append(train_acc)
all_test.append(test_acc)
# plot train and test means
plt.semilogx(values, all_train, label='train', marker='o')
plt.semilogx(values, all_test, label='test', marker='o')
plt.xlabel('regularization strength')
plt.ylabel('accuracy')
plt.legend()
plt.show()
The results suggest that $0.01$ or $0.001$ may be sufficient and may provide good bounds for further grid searching.
Deep learning models are capable of automatically learning a rich internal representation from raw input data.
This is called feature or representation learning. Better learned representations, in turn, can lead to better insights into the domain, e.g. via visualization of learned features, and to better predictive models that make use of the learned features.
A problem with learned features is that they can be too specialized to the training data, or overfit, and not generalize well to new examples. Large values in the learned representation can be a sign of the representation being overfit. Activity or representation regularization provides a technique to encourage the learned representations, the output or activation of the hidden layer or layers of the network, to stay small and sparse.
There is a field of study focused on the efficient and effective automatic learning of features, often investigated by having a network reduce an input to a small learned feature before using a second network to reconstruct the original input from the learned feature. Models of this type are called auto-encoders, or encoder-decoders, and their learned features can be useful to learn more about the domain (e.g. via visualization) and in predictive models.
The learned features, or encoded inputs
, must be large enough to capture the salient features of the input but also focused enough to not over-fit the specific examples in the training dataset. As such, there is a tension between the expressiveness and the generalization of the learned features.
The loss function of the network can be updated to penalize models in proportion to the magnitude of their activation.
This is similar to weight regularization
where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its activation
, as such, this form of penalty or regularization is referred to as activation regularization
or activity regularization
.
Activity regularization may be best suited to those model types that explicitly seek an efficient learned representation.
These include models such as autoencoders (i.e. sparse autoencoders) and encoder-decoder models, such as encoder-decoder LSTMs used for sequence-to-sequence prediction problems.
Configure the layer chosen to be the learned features, e.g. the output of the encoder or the bottleneck in the autoencoder, to have more nodes that may be required.
This is called an overcomplete representation that will encourage the network to overfit the training examples. This can be countered with a strong activation regularization in order to encourage a rich learned representation that is also sparse.
Keras supports activity regularization.
Just like weight regularization it accepts l1, l2 and l1_l2 regularizers.
Activity regularization is specified on a layer in Keras.
This can be achieved by setting the activity_regularizer
argument on the layer to an instantiated and configured regularizer class.
The regularizer is applied to the output of the layer, but you have control over what the “output” of the layer actually means. Specifically, you have flexibility as to whether the layer output means that the regularization is applied before or after the activation function
.
For example, you can specify the function and the regularization on the layer, in which case activation regularization is applied to the output of the activation function, in this case, rectified linear activation function or ReLU.
Dense(32, activation='relu', activity_regularizer=l1(0.001))
Alternately, you can specify a linear activation function (the default, that does not perform any transform) which means that the activation regularization is applied on the raw outputs, then, the activation function can be added as a subsequent layer.
from keras.layers import Activation
Dense(32, activation='linear', activity_regularizer=l1(0.001))
Activation('relu')
The latter is the preferred usage of activation regularization as described in “Deep Sparse Rectifier Neural Networks” in order to allow the model to learn to take activations to a true zero value in conjunction with the rectified linear activation function. Nevertheless, the two possible uses of activation regularization may be explored in order to discover what works best for your specific model and dataset.
The example below sets l1 norm activity regularization on a Dense fully connected layer.
Dense(32, activity_regularizer=l1(0.001))
The example below sets l1 norm activity regularization on a Conv2D convolutional layer.
Conv2D(32, (3,3), activity_regularizer=l1(0.001))
The example below sets l1 norm activity regularization on an LSTM recurrent layer.
LSTM(32, activity_regularizer=l1(0.001))
The example below sets l1 norm activity regularization on an embedding layer.
Embedding(10, 5, activity_regularizer=l1(0.001))
In this section, we will demonstrate how to use activity regularization to reduce overfitting of an MLP on the same binary classification problem in the previous example.
Although activity regularization is most often used to encourage sparse learned representations in autoencoder and encoder-decoder models, it can also be used directly within normal neural networks to achieve the same effect and improve the generalization of the model.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='linear', activity_regularizer=l1(0.0001)))
model.add(Activation('relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot history
plt.plot(history.history['accuracy'], label=f'train, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_accuracy'], label=f'test, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
Model accuracy on both the train and test sets continues to increase to a plateau.
Weight constraints provide an approach to reduce the overfitting. A weight constraint is an update to the network that checks the size of the weights, and if the size exceeds a predefined limit, the weights are rescaled so that their size is below the limit or between a range.
You can think of a weight constraint as an if-then rule checking the size of the weights while the network is being trained and only coming into effect and making weights small when required. Note, for efficiency, it does not have to be implemented as an if-then rule and often is not.
Unlike adding a penalty to the loss function, a weight constraint ensures the weights of the network are small, instead of mearly encouraging them to be small.
There are multiple types of weight constraints, such as maximum and unit vector norms, and some require a hyperparameter that must be configured.
Vector Norm:
Calculating the size or length of a vector is often required either directly or as part of a broader vector or vector-matrix operation.
The length of the vector is referred to as the vector norm or the vector’s magnitude.
The Keras API supports weight constraints.
The constraints are specified per-layer, but applied and enforced per-node within the layer.
Using a constraint generally involves setting the kernel_constraint
argument on the layer for the input weights and the bias_constraint
for the bias weights.
Generally, weight constraints are not used on the bias weights.
A suite of different vector norms can be used as constraints, provided as classes in the keras.constraints
module. They are:
For example, a constraint can imported and instantiated:
# import norm
from keras.constraints import max_norm
# instantiate norm
norm = max_norm(3.0)
Dense(32, kernel_constraint=max_norm(3), bias_constraint=max_norm(3))
The example below sets a maximum norm weight constraint on a convolutional layer.
Conv2D(32, (3,3), kernel_constraint=max_norm(3), bias_constraint=max_norm(3))
Unlike other layer types, recurrent neural networks allow you to set a weight constraint on both the input weights and bias, as well as the recurrent input weights.
The constraint for the recurrent weights is set via the recurrent_constraint
argument to the layer.
The example below sets a maximum norm weight constraint on an LSTM layer.
LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), bias_constraint=max_norm(3))
The example below sets a maximum norm weight constraint on a embedding layer.
Embedding(10, 5, embeddings_constraint=max_norm(3))
In this section, we will demonstrate how to use weight constraints to reduce overfitting of an MLP on the same binary classification problem in the previous example.
There are a few different weight constraints to choose from. A good simple constraint for this model is to simply normalize the weights so that the norm is equal to $1.0$.
This constraint has the effect of forcing all incoming weights to be small.
We can do this by using the unit_norm
in Keras. This constraint can be added to the first hidden layer.
from keras.constraints import unit_norm
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot history
plt.plot(history.history['accuracy'], label=f'train, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_accuracy'], label=f'test, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
Model accuracy on both the train and test sets continues to increase to a plateau.
A problem with training neural networks is in the choice of the number of training epochs to use.
Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.
Keras supports the early stopping of training via a callback called EarlyStopping
.
This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process.
The EarlyStopping
callback is configured when instantiated via arguments.
There are a number of parameters that are specified to the EarlyStopping
object.
First we build a model without early stopping with a large number of epochs to encourage overfitting with the same dataset used in the previous examples.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot history
plt.plot(history.history['loss'], label=f'train loss, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_loss'], label=f'test loss, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
Reviewing the figure, we can also see flat spots in the ups and downs in the validation loss. Any early stopping will have to account for these behaviors. We would also expect that a good time to stop training might be around epoch $800$.
from keras.callbacks import EarlyStopping
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# simple early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot history
plt.plot(history.history['loss'], label=f'train loss, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_loss'], label=f'test loss, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
Reviewing the line plot of train and test loss, we can indeed see that training was stopped at the point when validation loss began to plateau for the first time.
We can improve the trigger for early stopping by waiting a while before stopping.
This can be achieved by setting the “patience” argument.
In this case, we will wait $200$ epochs before training is stopped. Specifically, this means that we will allow training to continue for up to an additional $200$ epochs after the point that validation loss started to degrade, giving the training process an opportunity to get across flat spots or find some additional improvement.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
# plot training history
plt.plot(history.history['loss'], label=f'train loss, accuracy {round(train_acc, 3)}')
plt.plot(history.history['val_loss'], label=f'test loss, accuracy {round(test_acc, 3)}')
plt.legend()
plt.show()
We can also see that test loss started to increase again in the last approximately $100$ epochs.
Although the performance of the model has improved, we may not have the best performing or most stable model at the end of training. We can address this by using a ModelChecckpoint
callback.
There are a number of parameters that are specified to the ModelChecckpoint
object.
In this case, we are interested in saving the model with the best accuracy on the test dataset. We could also seek the model with the best loss on the test dataset, but this may or may not correspond to the model with the best accuracy.
This highlights an important concept in model selection. The notion of the “best” model during training may conflict when evaluated using different performance measures. Try to choose models based on the metric by which they will be evaluated and presented in the domain. In a balanced binary classification problem, this will most likely be classification accuracy. Therefore, we will use accuracy on the validation in the ModelCheckpoint
callback to save the best model observed during training.
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# simple early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es, mc])
# load the saved model
saved_model = load_model('best_model.h5')
# evaluate the model
_, train_acc = saved_model.evaluate(trainX, trainy, verbose=0)
_, test_acc = saved_model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
In this case, we don’t see any further improvement in model accuracy on the test dataset. Nevertheless, we have followed a good practice.
Why not monitor validation accuracy for early stopping?
The main reason is that accuracy is a coarse measure of model performance during training and that loss provides more nuance when using early stopping with classification problems. The same measure may be used for early stopping and model checkpointing in the case of regression, such as mean squared error.