Normalization, Regularization and Learning Rate Scheduling

While training a model, a few important mathematical tricks have to be employed. Normalization and regularization techiniques belong to them. They are very important to successfully train models, and they are often confusing and sometimes hard to understand the details as well. Additionally, learning rate scheduling is also very important while training models, especially large ones.

Normalization vs Regularization

To put it simple, normalization is applied on data. Before training a model, data should be normalized. There are a few irresistible benefits to do it. It stablizes the training process, speeds up model convergence significantly, and improve model’s generalization capability.

Regularization is the model tuning technique by which the main purpose is to prevent overfitting while training. It could be applied on training data, or add constraints to learnable parameters. It also has the effect of improving model’s generalization. Regularization techniques are very flexible.

Both normalization and regularization could be employed on model training process, and their intensities would be cancel out with each other.

Generalization

Classical generalization theory suggests that to close the gap between train and test performance, we should aim for a simple model. What are the sense of simplicity:

fewer parameters (lower dimensions)
smoothness (not sensitive to small changes)

However, modern deep learning models are becoming larger and larger, which is challenging this classic theory. And some network architectures and regularization techniques could be possibly used to justify over-parameterized models. E.g. the skip connection might skip a whole block and make those parameters inside the skipped block useless. E.g. dropout tricks don’t use all parameters in training.

Underfit and Overfit

overfit: a model fits training data perfectly, but fails to predict unseen data.
underfit: opposite of overfitting, not enough training or the model is too small to learning enough.

In deep learning, a few common reasons could cause overfitting:

insufficient training data
low quality of training data (contain errors)
too large or complex of the model
lack of regularization

Curse of Dimensionality

The curse of dimensionality refers to a set of problems that arise when dealing with data in high-dimensional spaces (i.e., datasets with a large number of features or dimensions, such as image data). Typically, the volume of the data space grows exponentially with the number of dimensions. With a fixed number of data points, these points become increasingly sparse and spread out in a high-dimensional space. The curse of dimensionality significantly increases the risk of overfitting. In a high-dimensional space with sparse data, a complex model has more “degrees of freedom” to fit the training data perfectly, including the noise. Because there are fewer data points relative to the number of dimensions, the model can easily find spurious patterns that don’t generalize. (The size of the dataset is related to the data’s dimension.)

About high dimension space, I recently got an idea. The density of high dimension space is still equal everywhere, like 2D or 3D spaces.

Benign Overfitting

Interestingly, modern deep learning has challenged traditional views on overfitting. Very large models trained on large datasets can sometimes generalize well even when they have enough capacity to memorize the entire training set. This phenomenon, sometimes called benign overfitting, suggests the relationship between model size, data, and overfitting is more complex than previously thought.

Benign overfitting is a phenomenon in machine learning where a model perfectly fits (interpolates) the training data, including its noise or random fluctuations, yet still generalizes well to unseen test data. This concept challenges the traditional view in statistical learning theory, which posits that fitting the noise in the training data (overfitting) should lead to a significant deterioration in performance on new data.

Benign overfitting is typically observed in overparameterized models, such as modern deep neural networks, where the number of parameters is much larger than the number of data points. It is closely related to the double descent phenomenon.

Normalization

Normalization is applied on data. Other terms for data normalization are feature scaling and standardization. There are two main ways:

Min-Max scaling. Scale feature values to $[0,1]$ range. $\left(x_s=\cfrac{x-x_{min}}{x_{max}-x_{min}}\right)$
Standardization (Z-score). Scale feature values to Normal Distribution, $[-1,1]$ range with mean $0$ and standard deviation (s) of $1$. $\left(z = \cfrac{x-\overline x}{s}\right)$

After normalization, each feature loses its real-world unit and is forced to be distributed in a same small range. The benefits of data normalization:

Ensure each feature contribute equally. (Different features might have different ranges and units.)
Stablize the training process.
Speed up the model’s convergence.
Improve the generalization and robustness.

Data Leakage

The most common pitfall of data normalization is data leakage. This happens when the data normalization is done before data splitting (split into train, validation and test).

There are two “correct” sequences to do data normalization:

(very common) Data should be first split into train and test parts, and then normalize the whole train part. When you get the mean and standard deviation (std) values on the whole train dataset, apply them to test dataset and split train dataset into train and validation.
(more rigorous) Data should be first split into train, validation and test parts. Then normalize train dataset, calculate mean and std values. Finally, apply mean and std on validation and test datasets.

Approach 1 is acceptable because the difference is minimal and both train and validation datasets are used for model training and tuning. Performance on validation dataset is not the final interest. Both approaches prevent test dataset from being involved into mean and std calculcation. That’s important! I think another reason why approach 1 is very common is that it’s more convenient and requires less coding. E.g. the mean and std values for CIFAR-10 dataset from Internet are calculated on the whole train part by approach 1.

# They are calculated from the whole train part.
# RGB channels, each channel is treated as a feature.
# Rescale [0,255] to [0,1] and then calculate z-socre!
cifar10_mean = (0.4914, 0.4822, 0.4465)
cifar10_std  = (0.2470, 0.2435, 0.2616)

Validation Dataset

How to decide how many data points should be split into validation dataset? Generally speaking, The smaller dataset you have, the larger percentage it should be. The key point is that your validation dataset must have to be representative to the whole dataset. The same is for the test dataset.

For ImageNet-1K dataset, people normally take validation part as test data since the label for test part is not public available. Training can be without specially handed out validation dataset from train part like in the original ResNet paper because the training plan is fixed ahead.

Batch Normalization

Batch normalization is not for data pre-processing. It’s employed as a learnable layer in neuron networks, especially CNNs. In order to stablize and speed up the model training, batch normalization is commonly applied just after convolutional layers.

Deep neuron networks have multiple layers. After each layer’s calculation, the input for each layer would be shifted. Since we know the benefits of data normalization and apply it to input data, why not we apply it to the input data of hidden layers! That’s the intuition of batch normalization.

$$y=\gamma\cdot\cfrac{x-\overline x}{s+\epsilon}+\beta$$

The input is $x$, and $y$ is output. Batch normalization calculcates $\overline x$ and $s$ on mini-batch. $\epsilon$ is a small fixed number used to prevent dividing zero. $\gamma$ and $\beta$ are two learnable parameters.

Batch normalization is required a mini-batch while model training. Someone say the best mini-batch size is from 32 to 128. It depends! And batch normalization is usually implemented before activation layers. Experiments show a better performance on this sequence of implementation.

Conv → BatchNorm → Activation (ReLU/GELU/SiLU etc.)

In CNNs on image tasks, batch normalization is applied on each channel (feature). One pair of $\gamma$ and $\beta$ for each channel. In fully connected layers, batch normalization is also applied on each feature (neuron). One pair of $\gamma$ and $\beta$ for each feature neuron.

Why there is a pair of learnable parameter for each feature? Because we don’t know what distribution for which layer is the best. There two parameters could make the model to learn the best input distribution, even undo the batch normalization.

Another key point in batch normalization is the running mean and std. They exist due to the fact that the difference between model training and prediction. Model training is executed by mini-batch. However, while prediction, there most likely be only one input each time. We cannot calculate mean and std on single input. So, the batch normalization layers consistently update the running mean and std, and use them while prediction.

Set bias=False in Conv layer when BatchNorm follows

This constant bias $b$ in filters of convolutional layer will be completely canceled out when the mean $\overline x$ of the batch is calculated and subtracted during the batch normalization step (convolutional filters share the same weights and biases while moving position). The subsequent learnable shift parameter $\beta$ in the BatchNorm layer serves the exact same purpose as the original bias $b$ (to shift the distribution).

Layer Normalization

Layer normalization normalizes the inputs across all features for each individual data sample. It’s very common in transformer architecture. Basically, layer normalization is the standardization of embedding vectors. (Embedding is the feature vector for each token. Each dimension is treated as a feature.)

In transformer architecture, the length of the sequence is undetermined. So, batch normalization is not applicable. Layer normalization is independently to the various batch length. The math formula for layer normalization is the same with batch normalization, and each feature gets a pair of $\gamma$ and $\beta$.

In the original transformer paper, the implementation of layer normalization is called Post-LN. Modern implementation is called Pre-LN in which there is a more clear gradient path.

# Post-LN
x = LN(x + Attention(x))
x = LN(x + FeedForward(x))

# Pre-LN, preferred!
# normalized input for each sub-layer in attention block
x = x + Attention(LN(x))
x = x + FeedForward(LN(x))

Root Mean Square Layer Normalization (RMSNorm)

Regularization

Regularization techniques help prevent overfitting by constraining or adding noise to the learning process, encouraging models to learn more generalizable patterns rather than memorizing the training data.

L1 & L2

L1 regularization adds a penalty which is the sum of the absolute value of all weights in a model to the loss function.

$$loss=\textit{data loss} + \lambda\cdot\sum_i{\left|w_i\right|}$$

L2 regularization adds a penalty which is the sum of the squared value of all weights in a model to the loss function. L2 regularization is also called weight decay.

$$loss=\textit{data loss} + \cfrac{\lambda}{2}\cdot\sum_i{w_i^2}$$

$\lambda$ is the penalty strength of regularization, which is a hyperparameter as well.

L1 regularization could potentially drive weights to zero. However, L2 regularization couldn’t. Both L1 and L2 regularization are trying to make the model learn smaller weights which could lower the sensitivity and keep the network’s response smooth. Imagin that there are infinity sets of weights for a model to performance well, and we choose the one with smaller weights by L1 or L2 regularization.

PyTorch does not have built-in L1 regularization in the same way it has L2 regularization, but you can implement it easily.

Weight Decay in SDG,Adam,AdamW Optimizers

In PyTorch, weight decay (L2 regularizaiton, default 0) is equivalent with its mathematical definition in SGD and AdamW optimizer, but not exactly the same in Adam optimizers (even though the name of the parameter is still called weight decay).

SGD is a non-adaptive gradient optimizer, which means each parameter is updated by its current gradient. Adam and AdamW are called adaptive gradient (or learning rate) optimizers, which means the current gradient is rescaled respectively for each parameter by history gradient data.

In Adam optimizer, weight decay is first applied to current gradient, and then the decayed gradient would go through an adaptive mechanism before used to update parameters. The decoupled version AdamW (decouple the weight decay from adaptive gradient calculation) calculates adaptive gradient based only on current gradient, and then update parameters by adaptive gradient and weight decay.

Dropout

Dropout is an extremely effective, simple regularization technique. While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks. (However, the exponential number of possible sampled networks are not independent because they share the parameters.)

The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights.

Dropout might be one of the most easy to use regularization techniques in neuron network to prevent overfitting. However, where to put dropout layer is a bit tricky? The most common place for dropout is after activation. In CNNs, in order to keep spatial information, dropout is less common. If you want to apply dropout in CNN, do it after flatten layer. In attention block, dropout could be applied to attention weights and/or output.

Dropout is applied element-wise to tensors, regardless of their shape. Each element is independently. So, dropout could be applied on any shape of tensor data. Dropout not only randomly set element to be zero, but also rescale the remained element by $\frac{1}{1-p}$ in order to keep the same expectation ($p$ is the dropout probability).

>>> a = torch.tensor((1,2,3,4),dtype=torch.float32)
>>> dropout = nn.Dropout(0.5)  # p=0.5
>>> dropout(a)
tensor([0., 4., 6., 0.])

We can think of dropout as a way to add noise to the input of some hidden layers, but keep the whole expectation unchanged.

Dropout and BatchNorm

Modern CNN architectures, such as ResNet, don’t use dropout at all. When Dropout is used in conjunction with BatchNorm, it can sometimes lead to degraded performance or slower convergence. Dropout introduces its own noise by randomly setting activations to zero. This disrupts the distribution of the activations, which is exactly what BN is trying to stabilize. Furthermore, The BatchNorm layer calculates its mean and variance statistics based on the layer input after Dropout. If many inputs are zeroed out, the batch statistics become less reliable estimators of the true population statistics, which can hurt the model’s ability to generalize well during inference (when BatchNorm uses its fixed running mean and variance). Put it simple, element-wise dropout is not suitable for convolutional layers, which intensively employ BatchNorm as well.

Spatial Dropout

Conventional dropout is element-wise, while spatial dropout is channel-wise. Setting-to-zero and rescaling remains are happened on a whole channle.

>>> a = torch.ones(1,4,2,2)      # all 1
>>> dropout = nn.Dropout2d(0.5)  # spatial dropout
>>> dropout(a)
tensor([[[[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]],

         [[2., 2.],
          [2., 2.]],

         [[2., 2.],
          [2., 2.]]]])

In convolutional layer, spatial dropout could be more effective than element-wise dropout. Element-wise dropout does not suitable for convolutional. One of the reasons is the under-dropping problem (information leakage problem). Because nearby pixels are highly correlated, removed information could be recovered in following layer, like interpolation. Spatial dropout avoids this issue by channel-wise dropout operation. Like element-wise dropout, the best place for spatial dropout is still after activation.

Conv2d → BatchNorm → Activation → Dropout2d

However, modern SOTA CNN architectures, such as ResNet, don’t use spatial dropout at all. It might because batch normalization technique reduced the need for the dropout method because of its regularization effect.

DropBlock

DropPath (Stochastic Depth)

Data Augmentation

By injecting noise into training data, data augmentation techniques could force models to learning more general features and consequently improve generalization capability. Additionally, data augmentation could also prevent overfitting by enlarging the training data in a specific way. It is a very useful and helpful techinique, especially when training data is not sufficient.

There are lots of data augmentation tricks. However, you don’t need to apply all of them. The basic idea is that you should pick up those augmentation tricks which are helpful to your application. E.g. while working on CIFAR-10 dataset, there is no upside-down images in test dataset. So, random vertical flip is completely not necessary. The model is not expected to see any upside-down images. If you are working on improving robustness, you might need to apply some natural corruptions on to training images.

cifar10_mean = (0.4914, 0.4822, 0.4465)
cifar10_std  = (0.2470, 0.2435, 0.2616)

# This is the setting in the original paper of ResNet
aug_transform = transforms.Compose(
    [
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(cifar10_mean, cifar10_std)
    ]
)

Early Stop

Early stop is a classic technique, and is generally considered a form of regularization for training deep learning model, though it’s somewhat different from traditional regularization techniques. It implicitly regularizes by limiting optimization iterations. Early stop is the technique for potentially saving time and money.

By evaluating the loss on validation dataset, we could stop the training process when finding that there is no improvement for a fixed number of epochs or iterations. Except validation loss, the loss gap between train and validation data could also be helpful. Plateau learning rate decay is also often employed together with early stop.

Learning Rate Scheduling

Learning Rate Warm-up

While training models, especially large models like ResNet-50 which has more than 25M parameters, it is more often to come across numeric instability in the early stage of training due to the randomly initialized weights. We might be able to overcome this issue by using very small learning rate in the first few epochs, and gradually increase learning rate until it reaches the target value. The most common strategy is linear learning rate warm-up.

Learning rate warm-up is not a regularization method. It’s main purpose is to stablize the training process, not to prevent overfitting. It addresses issues like noisy gradients from random weight initialization, allowing the model to converge more reliably without exploding or vanishing updates. This ramp-up phase helps the optimizer “settle” before applying a full learning rate, but it doesn’t directly constrain the loss function or model complexity. While not regularization per se, learning rate warm-up can contribute to better generalization in practice. By enabling higher peak learning rates without instability, it allows the model to escape poor local minima more effectively, which might lead to flatter minima (associated with better generalization).

Gradient Max Norm

It’s the Euclidean norm of gradient. Think of it as the length of gradient. Training a model is just like going down a hill. Imagin that we are very close to a minimal flat area, the length of gradient should not be very large because there are no very steep slopes. Large gradient with high learning rate is more likely to cause instability in training. Set a max norm for gradient in training process is a very useful safe guard.

Another perspective is the averaged loss value. The higher the loss, the higher possibility of large gradient. High loss means you are still far away from some minimal flat areas, and there might be very steep ways in front of you. So, be careful when your step is relatively large (high learning rate). E.g. when training ResNet-18 on CIFAR-10, we can get loss below 2 after the first epoch. However, when training ResNet-50 on ImageNet-1K, the loss might be still above 2 after the first 10 epochs.

Reproduce ResNet-50 on ImageNet-1K

The training process in the original paper of Resnet-50:

batch size: 256
weight decay: 1e-4
optimizer: SGD with momentum 0.9
learning rate 0.1 for first 30 epochs
learning rate 0.01 for second 30 epochs
learning rate 0.001 for last 30 epochs
top-1 accuracy on validation dataset: 76.13%

The first time I tried to train ResNet-50 on ImageNet-1K, every time at epoch 4, the loss became NaN. After applying Kaiming normalization, the loss became NaN as epoch 1. It’s better because I saved lots of time to get the NaN results. Then, I set a gradient max norm of 2 to avoid this issue. However, the final accuracy on validation dataset is 74.5%, which is a bit lower than around 76%. It might be because of the max norm of 2, which might be a bit aggressive to contrain gradients.

Using stabilizers like learning rate warm-up or gradient clipping (max norm) to fix NaN losses does not inherently mean it’s no longer a reproduction. The purpose should not be a strict bit-to-bit identical runs, but to achieve equivalent results. The original paper focused on the architecture, and it didn’t report any details about training process debugging. Furthermore, today’s hardwares are different, and every time the randomness is also impossbly identical. We should focus on results and high-level process. Fixing NaN is more like “debugging your environments”, not altering the model’s learning capability.

Learning Rate Decay

Learning rate decay is a technique in training neural networks where you gradually reduce the learning rate over time during the optimization process.

At the start of training, a larger learning rate helps you make rapid progress toward a good solution. But as you get closer to the optimal parameters, that same large learning rate can cause the optimizer to overshoot or oscillate around the minimum rather than settling into it. By reducing the learning rate, you allow the model to make finer adjustments and converge more smoothly.

The choice of decay schedule can significantly impact final model performance. Too aggressive decay might cause premature convergence to a suboptimal solution, while too slow decay wastes training time. While employing Adam or AdamW optimizer, which are adaptive learning rate optimizers, the actual decayed values of learning rate are not that sensitive comparing to SGD, but still matters!

Reference

Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). pmlr.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
Lee, S., & Lee, C. (2020). Revisiting spatial dropout for regularizing convolutional neural networks. Multimedia Tools and Applications, 79(45), 34195-34207.

Normalization vs Regularization#

Generalization#

Underfit and Overfit#

Curse of Dimensionality#

Benign Overfitting#

Normalization#

Data Leakage#

Validation Dataset#

Batch Normalization#

Layer Normalization#

Root Mean Square Layer Normalization (RMSNorm)#

Regularization#

L1 & L2#

Weight Decay in SDG,Adam,AdamW Optimizers#

Dropout#

Spatial Dropout#

DropBlock#

DropPath (Stochastic Depth)#

Data Augmentation#

Early Stop#

Learning Rate Scheduling#

Learning Rate Warm-up#

Gradient Max Norm#

Learning Rate Decay#

Reference#