While training a model, a few important mathematical tricks have to be employed. Normalization and regularization techiniques belong to them. They are very important to successfully train models, and they are often confusing and sometimes hard to understand the details as well. Additionally, learning rate scheduling is also very important while training models, especially large ones.
Normalization vs Regularization
To put it simple, normalization is applied on data. Before training a model, data should be normalized. There are a few irresistible benefits to do it. It stablizes the training process, speeds up model convergence significantly, and improve model’s generalization capability.
Regularization is the model tuning technique by which the main purpose is to prevent overfitting while training. It could be applied on training data, or add constraints to learnable parameters. It also has the effect of improving model’s generalization. Regularization techniques are very flexible.
Cola and Tom
Both normalization and regularization could be employed on model training process, and sometimes their intensities would be cancel out with each other.
Generalization
Classical generalization theory suggests that to close the gap between train and test performance, we should aim for a simple model. What are the senses of simplicity:
- fewer parameters (lower dimensions)
- smoothness (not sensitive to small changes)
However, modern deep learning models are becoming larger and larger, which is challenging this classic theory. And some network architectures and regularization techniques could be possibly used to justify over-parameterized models. E.g. the skip connections might bypass a whole block and make those parameters inside the skipped block useless. E.g. dropout tricks don’t use all parameters in training.
Underfit and Overfit
- overfit: a model fits training data perfectly, but fails to predict unseen data.
- underfit: opposite to overfitting, not enough training or the model is too small to learning enough.
In deep learning, a few common reasons could cause overfitting:
- insufficient training data
- low quality of training data (contain errors)
- too large or complex of the model
- lack of regularizations
When you make the task easier, e.g. change binary segmentation task to multi-class segmentation task with more labeling on training data, you might observe overfitting even you keep the quantity of training data unchanged while training the model with exact the same size.
Curse of Dimensionality
The curse of dimensionality refers to a set of problems that arise when dealing with data in high-dimensional spaces (i.e., datasets with a large number of features or dimensions, such as image data). Typically, the volume of the data space grows exponentially with the number of dimensions. With a fixed number of data points, these points become increasingly sparse and spread out in a high-dimensional space. The curse of dimensionality significantly increases the risk of overfitting. In a high-dimensional space with sparse data, a complex model has more “degrees of freedom” to fit the training data perfectly, including the noise. Because there are fewer data points relative to the number of dimensions, the model can easily find spurious patterns that don’t generalize. (The size of the dataset is related to the data’s dimension.)
About high dimension space, I recently got an idea. The density of high dimension space might be still equal everywhere, like 2D or 3D spaces.
Benign Overfitting
Interestingly, modern deep learning has challenged traditional views on overfitting. Very large models trained on large datasets can sometimes generalize well even when they have enough capacity to memorize the entire training set. This phenomenon, sometimes called benign overfitting, suggests the relationship between model size, data, and overfitting is more complex than previously thought.
Benign overfitting is a phenomenon in machine learning where a model perfectly fits (interpolates) the training data, including its noise or random fluctuations, yet still generalizes well to unseen test data. This concept challenges the traditional view in statistical learning theory, which posits that fitting the noise in the training data (overfitting) should lead to a significant deterioration in performance on new data.
Benign overfitting is typically observed in over-parameterized models, such as modern deep neural networks, where the number of parameters is much larger than the number of data points. It is closely related to the double descent phenomenon.
I reckon it’s related to the very high dimension space the model created. Since the there are large enough dimensions and the training data could represent the general data, even there are noises in training data, high dimension space could overcome these noises and exhibit decent generalization capability.
Normalization
Normalization is applied on data.
Data Standardization
Other terms are feature scaling and static normalization. There are two main ways:
- Min-Max scaling. Scale feature values to $[0,1]$ range. $\left(x_s=\cfrac{x-x_{min}}{x_{max}-x_{min}}\right)$
- Standardization (Z-score). Scale feature values to Normal Distribution, $[-1,1]$ range with mean $0$ and standard deviation (s) of $1$. $\left(z = \cfrac{x-\overline x}{s}\right)$
After standardization, each feature loses its real-world unit and is forced to be distributed in a same small range. The benefits of data standardization:
- Ensure each feature contribute equally. (Different features might have different ranges and units.)
- Stablize the training process.
- Speed up the model’s convergence.
- Improve the generalization and robustness.
It’s generally a good idea to integrate data standardization step into the model. Create the model with pre-calculated mean and standard deviation on training data as non-learnable registered buffers (PyTorch). It could simplify the code and make the model deployed more easy, such as exporting to ONNX format.
Data Leakage
The most common pitfall of data normalization is data leakage. This happens when the data normalization is done before data splitting (split into train, validation and test).
There are two “correct” sequences to do data normalization:
- (very common) Data should be first split into train and test parts, and then normalize the whole train part. When you get the mean and standard deviation (std) values on the whole train dataset, apply them to test dataset and split train dataset into train and validation.
- (more rigorous) Data should be first split into train, validation and test parts. Then normalize train dataset, calculate mean and std values. Finally, apply mean and std on validation and test datasets.
Approach 1 is acceptable because the difference is minimal and both train and validation datasets are used for model training and tuning. Performance on validation dataset is not the final interest. Both approaches prevent test dataset from being involved into mean and std calculcation. That’s important! I think another reason why approach 1 is very common is that it’s more convenient and requires less coding. E.g. the mean and std values for CIFAR-10 dataset from Internet are calculated on the whole train part by approach 1.
# They are calculated from the whole train part.
# RGB channels, each channel is treated as a feature.
# Rescale [0,255] to [0,1] and then calculate z-socre!
cifar10_mean = (0.4914, 0.4822, 0.4465)
cifar10_std = (0.2470, 0.2435, 0.2616)
Validation Dataset
How to decide how many data points should be split into validation dataset? Generally speaking, The smaller dataset you have, the larger percentage it should be. The key point is that your validation dataset must have to be representative to the whole dataset. The same is for the test dataset.
For ImageNet-1K dataset, people normally take validation part as test data since the labels for test part is not publicly available. Training can be without specially handed out validation dataset from train part like in the original ResNet paper because the training plan is fixed ahead.
Batch Normalization (BN)
Batch normalization [1] is not for data pre-processing. It’s employed as a learnable layer in neuron networks, especially CNNs. In order to stablize and speed up the model training, batch normalization is commonly applied just after convolutional layers.
Deep neuron networks have multiple layers. After each layer’s calculation, the input for each layer would be shifted. Since we know the benefits of data standardization and apply it to input data, why not we apply it to the input data of hidden layers! That’s the intuition of batch normalization.
$$y=\gamma\cdot\cfrac{x-\overline x}{s+\epsilon}+\beta$$
The input is $x$, and $y$ is output. Batch normalization calculcates $\overline x$ and $s$ on each batch. $\epsilon$ is a small fixed number used to prevent dividing zero. $\gamma$ and $\beta$ are two learnable parameters.
Batch normalization is usually implemented before activation layers. Experiments show a better performance on this sequence of implementation.
Conv → BatchNorm → Activation (ReLU/SiLU/GELU etc.)
In CNNs on image tasks, batch normalization is applied on each channel (feature). One pair of $\gamma$ and $\beta$ for each channel. In fully connected layers, batch normalization is also applied on each feature (neuron). One pair of $\gamma$ and $\beta$ for each feature neuron. Batch norm calculates the mean and std on each feature across the batch.
Why there is a pair of learnable parameter for each feature? Because we don’t know what distribution for which layer is the best. There two parameters could make the model to learn the best input distribution, even undo the batch normalization. Another explanation is to compensate for the possible lost of representational ability.
Another key point in batch normalization is the running mean and std. They exist due to the fact that the difference between model training and prediction. Model training is executed by batches. However, while prediction, there most likely be only one input each time. We cannot calculate mean and std on single input. So, the batch normalization layers consistently update the running mean and std, and use them while prediction. And this is also the reason why the batch size couldn’t be too small for BN. BN depends on batch size, which might be a problem in some cases.
Set bias=False in Conv layer when BatchNorm follows
This constant bias $b$ in filters of convolutional layer will be completely canceled out when the mean $\overline x$ of the batch is calculated and subtracted during the batch normalization step (convolutional filters share the same weights and biases while moving position). The subsequent learnable shift parameter $\beta$ in the BatchNorm layer serves the exact same purpose as the original bias $b$ (to shift the distribution).
Minimum Batch Size (16)
Batch normalization requires a batch size while model training. Someone say the best batch size is ranged from 32 to 128. It depends! But, there should be a minimum batch size. Once the batch size is smaller than the minumum value, BN would become unstable ($\overline x$ and $s$ are unstable on very small batch). Imagine batch size is 1, BN has no effect on training. When this happens, we need replace BN with LN or GN, which are independent with batch size. I suggest take 16 as the minimum batch size, or go for GN.
Performance Comparison between BN and GN on Small Batch Size [6]
Segmentation tasks are more likely to run into this issue. When training models with multiple GPUs, pay attention to the effective batch size. PyTorch supports synchronized batch norm.
BN and DataParallel (DP)
When using DataParallel (DP) in PyTorch, the statistics calculated on each GPU are Not Synchronized. This is the fundamental reason why Batch Normalization (BN) often fails to work correctly when using DP, and why researchers developed SyncBatchNorm or moved to DistributedDataParallel (DDP)
Layer Normalization (LN)
Layer normalization [2] normalizes the inputs across all features for each individual data sample. It’s very common in transformer-based architecture. Basically, layer normalization is the standardization of embedding vectors. (Embedding is the feature vector for each token. Each dimension is treated as a different feature.)
In transformer-based architecture, the length of the sequence is undetermined. So, batch normalization is not applicable. Layer normalization is independently to the various sequence length. The math formula for layer normalization is the same with batch normalization, and each feature gets a pair of $\gamma$ and $\beta$.
In the original transformer paper, the implementation of layer normalization is called Post-LN. Modern implementation is called Pre-LN in which there is a more clear gradient path.
# Post-LN
x = LN(x + MHSA(x)) # Multi-Head Self-Attention
x = LN(x + FFNN(x)) # Feed Forward Neural Network
# Pre-LN, preferred!
# normalized input for each sub-layer in attention block
x = x + MHSA(LN(x))
x = x + FFNN(LN(x))
BN normalizes data on column (axis=0, first Batch dimension). LN normalizes data on row (aixs=1, second Channel/Feature dimension).
Applying LN on CNN
When applying LN on CNN networks, it’s just like group normalization (GN) with group equals one. All elements in each feature map are included in calculation. Think of each feature is not a single number, is actually a group a numbers and LN treats them the same.
Position-wise LN
Position-wise LN treats each spatial position independently and calculates statistics only across channel dimension. Position-wise LN is used in ConvNeXt architecture which replaces BN with this channel-wise LN (applied after depthwise conv), achieving strong performance on ImageNet while being batch-independent. However, the learnable parameters $\gamma$ (scale) and $\beta$ (shift) are shared across all spatial positions to control the number of learnable parameters and keep the translation invariance of CNN.
class PositionwiseLN(nn.Module):
def __init__(self, n_channel):
super().__init__()
self.ln = nn.LayerNorm(n_channel)
def forward(self, x):
x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
x = self.ln(x)
return x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)
Below is a faster implementation of position-wise LN:
class FastPositionwiseLN(nn.Module):
def __init__(self, n_channel):
super().__init__()
self.gamma = nn.Parameter(torch.ones(1, n_channel, 1, 1))
self.beta = nn.Parameter(torch.zeros(1, n_channel, 1, 1))
def forward(self, x):
# calculate mean and variance across the channel dimension (dim=1)
mean = x.mean(1, keepdim=True)
var = x.var(1, keepdim=True, unbiased=False) # biased in original paper
# standardization
x = (x - mean) / torch.sqrt(var + 1e-8)
return self.gamma*x + self.beta
Instance Normalization (IN)
Instance normalization [7] was designed to improve feed-forward neural style transfer by normalizing features in a way that removes instance-specific style information (like contrast and brightness) while preserving content structure. IN normalizes across the spatial dimensions (Height and Width) for each channel and each sample independently. This makes it completely batch-independent and focuses on per-instance, per-channel statistics. It is particularly effective when the goal is to decouple style from content in images.
The math formular for IN is the same as BN. But IN only calculates mean and std for each sample in each channel. And finally it applies an affine (linear) transformation (learnable $\gamma$ and $\beta$) on each channel. All samples in each batch share the same learnable parameters for each channel. It is eactly a channel-wise normalization.
Group Normalization (GN)
Group normalization [6] divides channels into groups and computes normalization statistics within each group independently for each sample. Like LN and IN, it also doesn’t depend on batch size at all. Instead of normalizing across the batch dimension, it normalizes across groups of channels within each individual sample. When group number equals channel number, GN becomes IN. When group number is one, GN is LN (on CNN).
$C$ is channel number, and $G$ is group number. So we could have $K=C/G$ channels for each group. The calculation of mean and std is like:
$$\overline x_g=\cfrac{1}{K\cdot H\cdot W}\sum_{c\in G_g}\sum_{h=1}^{H}\sum_{w=1}^{W}x_{c,h,w}$$
$$s_g^2=\cfrac{1}{K\cdot H\cdot W}\sum_{c\in G_g}\sum_{h=1}^{H}\sum_{w=1}^{W}\big(x_{c,h,w}-\overline x_g\big)^2$$
Researchers like to set a constant G (group number) and apply it to each layer. This means in each layer of CNN network, the group sizes are different due to the fixed group number. G=32 is most common. Another way is to set a constant group size and we have different group number for each layer. The minimum group size should be 8 or 16 for stable statistics.
BN, LN, IN or GN
What’s the same for them:
- calculate mean and std for each feature
- two shared learnable parameters for each feature
- set bias=False in preceding conv layer if affine=True (default)
What’s the differences:
- BN is across batch dimension
- LN is across feature dimension
- IN is on its own
- GN is between LN and IN
4 different Normalization Layers [6]
Performance comparison on ImageNet with batch size of 32:
Performance Comparison with G=32 [6]
Switch to GN and reduce batch size to make more deep and wide experiments possible when your VRAM is limited! However, small batch size might introduce unstable gradient. When this happens, gradient accumulation is super useful! GN remove the dependency upon batch size. Gradient accumulation remove the limitation of VRAM. I love this combination! :)
Gradient Accumulation
batch_size = 16 # used to initialize data loader
virutal_batch_size = 64 # real batch size
accumulation_steps = virtual_batch_size // batch_size
optimizer.zero_grad()
for i,(inputs,labels) in enumerate(training_loader,1):
# forward pass
outputs = model(inputs.to(DEV))
loss = lossfunc(outputs.to(DEV), labels)
# scale loss to account for accumulation
loss = loss / accumulation_steps
loss.backward()
# only update weights after enough steps
if i % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
if i % accumulation_steps != 0:
optimizer.step() # effect: update with a little small learning rate
This is an industry-level hack for small VRAM. It doesn’t really reduce the real batch size, but makes it possible to train models when you run out of VRAM due to batch size. The last two lines works when len(training_loader) % accumulation_steps != 0. The effect is to run an update with a smaller learning rate because loss is divided by accumulate steps. This works perfect if you set drop_last=True in data loader.
Regularization
Regularization techniques help prevent overfitting by constraining or adding noise to the learning process, encouraging models to learn more generalizable patterns rather than memorizing the training data.
L1 & L2
L1 regularization adds a penalty which is the sum of the absolute value of all weights in a model to the loss function.
$$loss=\textit{data loss} + \lambda\cdot\sum_i{\left|w_i\right|}$$
L2 regularization adds a penalty which is the sum of the squared value of all weights in a model to the loss function. L2 regularization is also called weight decay.
$$loss=\textit{data loss} + \cfrac{\lambda}{2}\cdot\sum_i{w_i^2}$$
$\lambda$ is the penalty strength of regularization, which is a hyperparameter as well.
L1 regularization could potentially drive weights to zero. However, L2 regularization couldn’t. Both L1 and L2 regularization are trying to make the model learn smaller weights which could lower the sensitivity and keep the network’s response smooth. Imagin that there are infinity sets of weights for a model to performance well, and we choose the one with smaller weights by L1 or L2 regularization.
PyTorch does not have built-in L1 regularization in the same way it has L2 regularization, but you can implement it easily.
Weight Decay in SDG,Adam,AdamW Optimizers
In PyTorch, weight decay (L2 regularizaiton, default 0) is equivalent with its mathematical definition in SGD and AdamW optimizer, but not exactly the same in Adam optimizers (even though the name of the parameter is still called weight decay).
SGD is a non-adaptive gradient optimizer, which means each parameter is updated by its current gradient. Adam and AdamW are called adaptive gradient (or learning rate) optimizers, which means the current gradient is rescaled respectively for each parameter by history gradient data.
In Adam optimizer, weight decay is first applied to current gradient, and then the decayed gradient would go through an adaptive mechanism before used to update parameters. The decoupled version AdamW [3] (decouple the weight decay from adaptive gradient calculation) calculates adaptive gradient based only on current gradient, and then update parameters by adaptive gradient and weight decay.
Dropout
Dropout [4] is an extremely effective, simple regularization technique. While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks. (However, the exponential number of possible sampled networks are not independent because they share the parameters.)
The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights.
Dropout might be one of the most easy to use regularization techniques in neuron network to prevent overfitting. However, where to put dropout layer is a bit tricky? The most common place for dropout is after activation. In CNNs, in order to keep spatial information, dropout is less common. If you want to apply dropout in CNN, do it after flatten layer. In attention block, dropout could be applied to attention weights and/or output.
Dropout is applied element-wise to tensors, regardless of their shape. Each element is independently. So, dropout could be applied on any shape of tensor data. Dropout not only randomly set element to be zero, but also rescale the remained element by $\frac{1}{1-p}$ in order to keep the same expectation ($p$ is the dropout probability).
>>> a = torch.tensor((1,2,3,4),dtype=torch.float32)
>>> dropout = nn.Dropout(0.5) # p=0.5
>>> dropout(a)
tensor([0., 4., 6., 0.])
We can think of dropout as a way to add noise to the input of some hidden layers, but keep the whole expectation unchanged.
Dropout and BatchNorm
Modern CNN architectures, such as ResNet, don’t use dropout at all. When Dropout is used in conjunction with BatchNorm, it can sometimes lead to degraded performance or slower convergence. Dropout introduces its own noise by randomly setting activations to zero. This disrupts the distribution of the activations, which is exactly what BN is trying to stabilize. Furthermore, The BatchNorm layer calculates its mean and variance statistics based on the layer input after Dropout. If many inputs are zeroed out, the batch statistics become less reliable estimators of the true population statistics, which can hurt the model’s ability to generalize well during inference (when BatchNorm uses its fixed running mean and variance). Put it simple, element-wise dropout is not suitable for convolutional layers, which intensively employ BatchNorm as well.
Spatial Dropout
Conventional dropout is element-wise, while spatial dropout [5] is channel-wise. Setting-to-zero and rescaling remains are happened on a whole channle.
>>> a = torch.ones(1,4,2,2) # all 1
>>> dropout = nn.Dropout2d(0.5) # spatial dropout
>>> dropout(a)
tensor([[[[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.]],
[[2., 2.],
[2., 2.]],
[[2., 2.],
[2., 2.]]]])
In convolutional layer, spatial dropout could be more effective than element-wise dropout. Element-wise dropout does not suitable for convolutional. One of the reasons is the under-dropping problem (information leakage problem). Because nearby pixels are highly correlated, removed information could be recovered in following layer, like interpolation. Spatial dropout avoids this issue by channel-wise dropout operation. Like element-wise dropout, the best place for spatial dropout is still after activation.
Conv2d → BatchNorm → Activation → Dropout2d
However, modern SOTA CNN architectures, such as ResNet, don’t use spatial dropout at all. It might because batch normalization technique reduced the need for the dropout method because of its regularization effect.
DropBlock
DropPath (Stochastic Depth)
Data Augmentation
By injecting noise into training data, data augmentation techniques could force models to learning more general features and consequently improve generalization capability. Additionally, data augmentation could also prevent overfitting by enlarging the training data in a specific way (not foundamental). It is a very useful and helpful techinique, especially when training data is not sufficient.
There are lots of data augmentation tricks. However, you shouldn’t apply all of them. The basic idea is that you should pick up those augmentation tricks which are helpful to your application. E.g. while working on CIFAR-10 dataset, there is no upside-down images in test dataset. So, random vertical flip is completely not necessary, even harmful. The model is not expected to see any upside-down images. If you are working on improving robustness, you might need to apply some natural corruptions on to training images.
Below is the data augmentation used in the original paper of ResNet. There is no vertical flip!
cifar10_mean = (0.4914, 0.4822, 0.4465)
cifar10_std = (0.2470, 0.2435, 0.2616)
# This is the setting in the original paper of ResNet
aug_transform = transforms.Compose(
[
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(cifar10_mean, cifar10_std)
]
)
Early Stop
Early stop is a classic technique, and is generally considered a form of regularization for training deep learning model, though it’s somewhat different from traditional regularization techniques. It implicitly regularizes by limiting optimization iterations. Early stop is the technique for potentially saving time and money.
By evaluating the loss on validation dataset, we could stop the training process when finding that there is no improvement for a fixed number of epochs or iterations. Except validation loss, the loss gap between train and validation data could also be helpful. Plateau learning rate decay is also often employed together with early stop.
Label Smoothing
Learning Rate Scheduling
Learning Rate Warm-up
While training models, especially large models like ResNet-50 which has more than 25M parameters, it is more often to come across numeric instability in the early stage of training due to the randomly initialized weights. We might be able to overcome this issue by using very small learning rate in the first few epochs, and gradually increase learning rate until it reaches the target value. The most common strategy is linear learning rate warm-up.
Learning rate warm-up is not a regularization method. It’s main purpose is to stablize the training process, not to prevent overfitting. It addresses issues like noisy gradients from random weight initialization, allowing the model to converge more reliably without exploding or vanishing updates. This ramp-up phase helps the optimizer “settle” before applying a full learning rate, but it doesn’t directly constrain the loss function or model complexity. While not regularization per se, learning rate warm-up can contribute to better generalization in practice. By enabling higher peak learning rates without instability, it allows the model to escape poor local minima more effectively, which might lead to flatter minima (associated with better generalization).
Clip Gradient by Max Norm
While training a model, we can clip gradient based on max norm. It is used as a safe guard to avoid gradient explosion. Training a model is just like going down a hill. Imagin that we are very close to a minimal flat area, the length of gradient should not be very large because there are no very steep slopes. Large gradient with high learning rate is more likely to cause instability in training.
Another perspective is the averaged loss value. The higher the loss, the higher probability of large gradient. High loss means you are still far away from some minimal flat areas, and there might be very steep ways in front of you. So, be careful when your step is relatively large (high learning rate). E.g., when training ResNet-18 on CIFAR-10, we can get loss below 2 after the first epoch. However, when training ResNet-50 on ImageNet-1K, the loss might be still above 2 after the first 10 epochs.
Max norm is just the Euclidean norm (L2) of gradient. Think of it as the length of gradient in high dimensional space. Clip based on max norm would affect all individual gradient values. It’s the standard way to clip gradient. It’s better than only clipping individual gradient element. Because clipping based on max norm treats the model as a whole and preserves gradient direction.
# set gradient max_norm to 5.0
# calculate total_norm of gradient, if total_norm > max_norm,
# scale all gradient values by max_norm/total_norm
nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=5.0)
Reproduce ResNet-50 on ImageNet-1K
The training process in the original paper of Resnet-50 (post-activation):
- batch size: 256
- 8 GPUs without SyncBatchNorm (effective batch size is 32)
- weight decay: 1e-4
- optimizer: SGD with momentum 0.9
- learning rate 0.1 for first 30 epochs
- learning rate 0.01 for second 30 epochs
- learning rate 0.001 for last 30 epochs
- top-1 accuracy on validation dataset: 75.3% (single-crop)
The first time I tried to train ResNet-50 on ImageNet-1K, every time at epoch 4, the loss became NaN. After applying Kaiming normalization, the loss became NaN as epoch 1. It’s better because I saved lots of time to get the NaN results. Then, I set a gradient max norm of 2 to avoid this issue. However, the final accuracy on validation dataset is 74.5%, which is just a bit lower than around 75.3%. But, my implementation is pre-activation version.
Using stabilizers like learning rate warm-up or gradient clipping (max norm) to fix NaN losses does not inherently mean it’s no longer a reproduction. The purpose should not be a strict bit-to-bit identical runs, but to achieve equivalent results. The original paper focused on the architecture, and it didn’t report any details about training process debugging. Furthermore, today’s hardwares are different, and every time the randomness is also impossbly identical. We should focus on results and high-level process. Fixing NaN is more like “debugging your environments”, not altering the model’s learning capability.
Learning Rate Decay
Learning rate decay is a technique in training neural networks where you gradually reduce the learning rate over time during the optimization process.
At the start of training, a larger learning rate helps you make rapid progress toward a good solution. But as you get closer to the optimal parameters, that same large learning rate can cause the optimizer to overshoot or oscillate around the minimum rather than settling into it. By reducing the learning rate, you allow the model to make finer adjustments and converge more smoothly.
The choice of decay schedule can significantly impact final model performance. Too aggressive decay might cause premature convergence to a suboptimal solution, while too slow decay wastes training time. While employing Adam or AdamW optimizer, which are adaptive learning rate optimizers, the actual decayed values of learning rate are not that sensitive comparing to SGD, but still matters!
Reference
- Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). pmlr.
- Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
- Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
- Lee, S., & Lee, C. (2020). Revisiting spatial dropout for regularizing convolutional neural networks. Multimedia Tools and Applications, 79(45), 34195-34207.
- Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (ECCV) (pp. 3-19).
- Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.