Convolutional neuron networks (CNN) have been revolutionized image-related Artificial Intellegence (AI) tasks. This kind of neuron network architectures have been obtained huge success. Furthermore, CNNs are very interesting. They have many design elements and variants you could play with. Academic researches might pursue a beauty of purity and emphasize the unique feature they proposed. However, in industry application, it’s not necessary to build a network pure. It’s highly possible to combine all necessary design elements together. However, we should have a deep understand of all these design elements.

Neural network is a complicated math function. Deep learning model is the combination of all kinds of basic operations, and trained by back-propagation algorithm. Design network architectures based on your data and application!

CNN’s Inductive Bias

Inductive biases are inner assumptions of each specific neuron network architecture that could make learning possible. They are prior knowledge built into a learning model that guide how to interpret the data it sees. There are just infinite number of hypotheses that could learn from and fit the training data. Inductive biases help narrow down the search space and guide the learning process. Interestingly, CNN and Vision Transformer (ViT) have completely different inductive biases, even though both of them could be applied to image tasks. Below are main inductive biases built in CNN architectures:

  • Spatial Locality, which assumes that nearby pixels in an image are more related and important to each other than distant pixels. This is implemented by small receptive fields. Normally, they are 3x3 or 5x5 kernels (filters), and only touch local neighborhoods.
  • Hierarchical Feature Learning, which means a stack of convolutional layers that learn features from small and simple ones to large and complex ones in a hierarchical way. This is realized by multiple convolutional layers stacked in between non-linear layers and pooling layers in the network [15].
  • Translation Invariance, which means the convolution process is exactly the same wherever the regions on the images. This is achieved by shared weights and bias of each filter.
  • Translation Equivariance, which means that the same object would have the same output in feature map in the same location as the input object no matter where the input object is located in the input image. $f(T(x))=T(f(x))$. A desired characteristic for object detection.
Tang Yuan, Tom in English

Tang Yuan, Tom in English

AlexNet

Modern CNN architecture started from AlexNet [1] in 2012. We can think of it as an enlarged version of old LeNet.

  • convolutional layers with filters size of 11x11, 5x5 and 3x3
  • ReLU activation function
  • pooling layers (max pooling)
  • final fully connected layers

Classic 3-Layer Sequence: Conv -> ReLU -> Pooling. Pooling layers are used to reduce resolution. So, there is a hard limit for the depth by this design.

These are the very basic CNN design elements.

The effect of pooling layer

  • reduce resolution
  • When there are some minor changes on the input image, such as different viewpoint, size, illumination, rotation or noises, the output would be the same. This is realized by pooling layer which aggregates information on a small region so that when there are some small changes in that small region, the output of that region in pooling layer might keep the same.

ZFNet

In the cornerstone paper [15], the authors proved a few important characristics of CNN networks:

  • hierarchical feature learning
  • feature convergence takes time (high layer feature maps need more epochs to learn and converge)
  • capability of transfer learning
  • the network is actually looking at the object, not just the background (occlusion sensitivity)
  • as the network goes deeper, the number of channel (feature map) should be larger

This paper didn’t just visualize AlexNet (the 2012 ImageNet winner). They used their Deconvnet tool to “debug” it. By seeing exactly what AlexNet was struggling with, they made several targeted adjustments that resulted in the ZFNet, which won the competition in 2013.

  • Change first conv layer from 11x11 stride 4 to 7x7 stride 2, which could captured much finer detail and smoother features and providing a better foundation for the deeper layers to build upon.
  • Make middle layers wider by increasing width of layer 3 from 384 to 512, increasing width of layer 4 from 384 to 1024, and increasing width of layer 5 from 256 to 512.

Why increase the number of channels as CNN networks go deeper?

  • Trade-off between spatial dimension and channel dimension (a.k.a width of convolution).
  • By increasing the channels as spatial size decreases, the increasing of width compensate the loss of spatial resolution and maintain the network’s capability to represent complex patterns.
  • Low level features are small and simple, no need too many feature types (maps). High level features are more complex, we need to know more “what” instead of “where”. Like a dictionary, we only need a few letters, but we have so many different words. One insight is that even though there are so many feature maps in the final conv layer, for a specific input, only a few of them are really activated (Channel Sparsity).

VGG

Started from the Visual Geometry Group (VGG) [2] at Oxford University in 2014, people began to talk about CNN Block and Network Family, such as VGG block and VGG family. CNNs became designs with both layer-based and block-based. A network family is a series of design based on same block design.

  • VGG Block: consecutive 3x3 with padding 1 filters (Conv + ReLU)
  • deep and narrow outperforms shallow and wide (deep means more non-linearity)
  • still has final fully connected layers with large number of parameters
  • Family: VGG16, VGG19… The suffix number indicates how many learnable layers. (Conventionally, people only count convolutional and fully connected layers as learnable layers, not include BN.)

Classic CNN architectures such as LeNet and AlexNet, they employed a classic 3-layer sequence design which consists of a convolutional layer with padding to keep resolution followed by a non-linear layer, and then a pooling layer to reduce resolution. Each of these 3-layer sequence would reduce the spatial resolution (R) by 50%. This design set a hard limit for the number of convolutional layers to $\log_2{R}$ and for the depth of CNN network as well.

In order to solve this issue, in VGG architecture, multiple convolutional layers were stacked together between each downsampling pooling layer. These stacked convolutional layers are all 3x3 with padding 1 for keeping the resolution unchanged. And these stacked convolutional layers combined with the following pooling layer together is called a block in VGG.

Typical VGG Block: Conv (3x3 pad 1) -> ReLU -> … -> Conv (3x3 pad 1) -> ReLU -> MaxPool (2x2). By this design, neuron networks could become deeper with more non-linear transforms.

By this block design, VGG network could become deeper than its predecessors and achieve better performance. Nonetheless, two successive 3x3 convolutional filters touch the same pixel area as one 5x5 filter does but with less paramters and better performance. In other words, deep and narrow neuron network outperforms shallow and wide counterparts significantly. Since VGG, 3x3 with padding 1 convolutional filter becomes a gold standard in CNN architecture design.

Two consecutive 3x3 filters touch the same area that one 5x5 filter does. The former has 3x3xChannelx2=18C weights, while the latter has 5x5xChannel=25C weights. The former has two non-linearities, while the latter only has one.

def vgg_block(num_convs, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

NiN

In Network in Network (NiN) [3] design, in order to remove the fully connected layers at the end of the architecture and keep the number of non-linear layer more or less the same, a NiN block was proposed which consists of a 3x3, 5x5 or 11x11 convolutional layer (as in AlexNet) followed by multiple 1x1 convolutional layers. This design significantly decreased the number of parameter of CNN network. It also introduced the Global Average Pooling (GAP) layer at the end to replace fully connected layers.

  • 1x1 convolutional layer (pointwise conv) to add non-linearity without destroying the spatial structure
  • 1x1 convolutional layer could be interpreted as a fully connected layer for each pixel (small network in big network)
  • employ global average pooling layer to remove fully connected layers (only effectvie with added non-linearity)

Typical NiN Block: Conv –> 1x1 Conv –> 1x1 Conv. Pooling layer is in between NiN blocks.

def nin_block(out_channels, kernel_size, stride, padding):
    return nn.Sequential(
        nn.LazyConv2d(out_channels, kernel_size, stride, padding),
        nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1),
        nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1),
        nn.ReLU())

GoogLeNet (Inception)

GoogLeNet [4] (captical middle L to honor LeNet) introduces the ideas of network branching and feature concatenation. The block in GoogLeNet is called Inception Block. Instead of deciding which size of convolutional layer is better by human, we could let the training process to discover it automatically. Between the input and the output in a inception block, there are multiple parallel paths to guide the data flow, and in each path it consists of one or two different sized convolutional layers. It is possible and expected that the data flow would go through multiple paths so that a concatenation operation is required at the output layer. However, the purpose of the concatenation operation is not only to make the whole computation feasible, but also served an important function which is to gather various features generated from different paths since all paths have the same input.

  • network branching
  • feature concatenation
  • bottleneck design

The 1x1 convolutional layer exists in each and every branches in inception block.

class InceptionBlock(nn.Module):
    
    def __init__(self, c1, c2, c3, c4):
        """ c1--c4 are the number of output channels for each branch,
            each branch has different output channel number,
            block output channel: c1 + c2[1] + c3[1] + c4
        """
        super().__init__()
        
        # branch 1, 1x1 conv
        self.branch1 = nn.Sequential(
            nn.LazyConv2d(c1, kernel_size=1),
            nn.ReLU()
        )

        # branch 2, 1x1 conv --> 3x3 conv
        self.branch2 = nn.Sequential(
            nn.LazyConv2d(c2[0], kernel_size=1),
            nn.ReLU(),
            nn.LazyConv2d(c2[1], kernel_size=3, padding=1),
            nn.ReLU()
        )

        # branch 3, 1x1 conv --> 5x5 conv
        self.branch3 = nn.Sequential(
            nn.LazyConv2d(c3[0], kernel_size=1),
            nn.ReLU(),
            nn.LazyConv2d(c3[1], kernel_size=5, padding=2),
            nn.ReLU()
        )

        # branch 4, 3x3 maxpooling --> 1x1 conv
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.LazyConv2d(c4, kernel_size=1),
            nn.ReLU()
        )

    def forward(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2(x)
        b3 = self.branch3(x)
        b4 = self.branch4(x)
        return torch.cat((b1,b2,b3,b4), dim=1)

The functionalities of 1x1 convolution:

  • Channel dimension modification. It could enlarge or reduce (compress) the number of channel.
  • Add non-linearity if activation function follows.
  • Reduce computation complexity. This is done by bottleneck design, such as 1x1 conv before 3x3 and 5x5 conv in branch 2 and 3. (Inception block introduces bottleneck design. ResNet formalizes the bottleneck block. Generally speaking, small conv filter is always more computationally efficient than big one, such as two 3x3 is cheaper than one 5x5.)
== Bottleneck ==

Task: (256,28,28) --> (128,28,28) by 5x5 conv with padding 2

WITHOUT 1×1 bottleneck:
Cost = 28×28×256×128×5×5 = 642M operations

WITH 1×1 bottleneck (Channel: 256 --> 32 --> 128):
Cost = 28×28×256×32×1×1 + 28×28×32×128×5×5
     = 6.4M + 80.3M = 86.7M operations

Savings: 86.5% reduction! 🎉

The inception block can be shown that the solution space of this architecture is a strict subspace of the solution space of a single large layer (e.g. 5×5 conv) operating on a high-dimensional embedding. The Inception block is designed to be an efficient, modular, and sparse replacement for a single, large convolutional layer. It is practically superior, and theoretically less expressive than a single giant layer. The split-transform-merge behavior of Inception blocks is expected to approach the representational power of large and dense layers, but at a considerably lower computational complexity.

BN-Inception (V2)

BatchNorm (BN) was first introduced in [16] together with Inception V2 module. Two main improvements for BN-Inception:

  • add BatchNorm after each conv layer
  • replace 5x5 conv with two consecutive 3x3 conv

Inception (V3)

GoogLeNet could reach roughly the same performance as VGG in ImageNet competition but with less parameters. GoogLeNet is still more parameter-efficient than some of its successors. In [13]: “Although VGGNet has the compelling feature of architectural simplicity, this comes at a high cost: evaluating the network requires a lot of computation. On the other hand, the Inception architecture of GoogLeNet was also designed to perform well even under strict constraints on memory and computational budget. For example, GoogleNet employed around 7 million parameters, which represented a 9× reduction with respect to its predecessor AlexNet, which used 60 million parameters. Furthermore, VGGNet employed about 3x more parameters than AlexNet.”

ResNet (V1)

As the CNN architectures go deeper and deeper, how could we make sure that the function represented by CNN network is becoming strictly more expressive and not just different? Residual Block in ResNet [5] solved this problem in an ingenious way by adding a residual connection. Besides, in order to speed up the training process for deeper and deeper CNN networks, Batch Normalization, which follows convolutional layer, is also added into residual block. Sometimes, the skip connection should have a 1x1 convolutional layer (stride 2) to align the channel width and data resolution, and make the computation feasible. Residual block is just like a special case of inception block. But, a significant different comparing with inception block is that residual block employ addition operation instead of concatenation, and that’s the reason why 1x1 convolutional layer is necessary. Nonetheless, the skip connection also make gradient flow more easier.

  • skip connection (shortcut, addition not concatenation)
  • BatchNorm after Conv
  • bottleneck block for even deeper architecture
  • strided convolution replaces pooling layer
  • post-activation (activation after addition)
  • stem + multi-stage design since the network is deeper, each stage halve the resolution at the beginning, and each stage has multiple blocks
class ResBlock(nn.Module):

    def __init__(self, n_channel, stride=1, use_1x1conv=False):
        """ When the input channel is not the equal output channel,
            use_1x1conv has to be True.
        """
        assert stride in (1, 2)
        super().__init__()

        if use_1x1conv:
            # projection shortcut: increase channels and/or halve resolution
            self.shortcut = nn.Sequential(
                nn.LazyConv2d(n_channel, kernel_size=1,
                                         stride=stride,
                                         bias=False),
                nn.LazyBatchNorm2d()
            )
        else:
            # empty nn.Sequential acts as a identity function
            self.shortcut = nn.Sequential()

        # 3x3 --> 3x3
        self.seq = nn.Sequential(
            nn.LazyConv2d(n_channel, kernel_size=3,
                                     stride=stride,
                                     padding=1,
                                     bias=False),
            nn.LazyBatchNorm2d(),
            nn.ReLU(),
            nn.LazyConv2d(n_channel, kernel_size=3, padding=1, bias=False),
            nn.LazyBatchNorm2d()
        )

    def forward(self, x):
        # nn.ReLU is part of network, F.relu is just a function.
        return F.relu(self.seq(x)+self.shortcut(x))

In order to improve the computation efficiency and reduce the number of parameters, ResNet also introduces a bottleneck block design. This allowed ResNet-50, ResNet-101, and ResNet-152 to be trained effectively.

class BotResBlock(nn.Module):

    def __init__(self, n_channel, stride=1, factor=4, use_1x1conv=False):
        assert num_channels%factor == 0
        assert stride in (1, 2)
        super().__init__()

        if use_1x1conv:
            self.shortcut = nn.Sequential(
                nn.LazyConv2d(n_channel, kernel_size=1,
                                         stride=stride,
                                         bias=False),
                nn.LazyBatchNorm2d()
            )
        else:
            self.shortcut = nn.Sequential()

        # bottleneck: 1x1 --> 3x3 --> 1x1
        # reduce computation and keep channel number
        self.seq = nn.Sequential(
            nn.LazyConv2d(n_channel//factor, kernel_size=1, bias=False),
            nn.LazyBatchNorm2d(),
            nn.ReLU(),
            nn.LazyConv2d(n_channel//factor, kernel_size=3,
                                             stride=stride,
                                             padding=1,
                                             bias=False),
            nn.LazyBatchNorm2d(),
            nn.ReLU(),
            nn.LazyConv2d(n_channel, kernel_size=1, bias=False),
            nn.LazyBatchNorm2d()
        )

    def forward(self, x):
        return F.relu(self.seq(x)+self.shortcut(x))

The above two residual block designs are called post-activation (activation after shortcut) ResNet as well.

Strided Convolution

When stride is bigger than 1, it’s called strided convolution. In ResNet design, pooling layer is replaced by strided convolution layer (except the stem part). Typically, stride=2 with padding=1 is used in ResNet to downsampling resolution 50%.

>>> a = torch.randn(1,3,32,32)
>>> conv = nn.Conv2d(3,3,kernel_size=3,stride=2,padding=1)
>>> conv(a).shape
torch.Size([1, 3, 16, 16])

Pooling layers (like Max Pooling or Average Pooling) discard spatial information by taking the maximum or average value over a region. While this works well for reducing dimension, it results in a loss of potential features. In contrast, a strided convolution reduces the spatial size while simultaneously performing a learned transformation on the data. It’s a learned downsampling. It imporves the information density as well.

When downsampling by strided 1x1 convolution in the shortcut connection, it would loss 75% information (1 pixel out of 4 pixels), and this is the trade-off in ResNet design. The main purpose of shortcut connection is the to keep the shape and preserve identity values, not to extract features.

ResNet (V2, Pre-activation)

The core problem: Gradient Flow in Deep Networks. In very deep networks, gradients can vanish (become too small) or explode (become too large) as they backpropagate through many layers. ResNet’s skip connections help, but the design details matter a lot. Pre-activation [6] means that the activation is before the shortcut connection.

One variant in [6] is called full pre-activation which is favored in many modern implementation. The “full” distinguishes it from partial pre-activation designs. It ensures that BatchNorm and ReLU are applied to the block’s input before it branches into:

  • The main residual path (the convolutions).
  • The shortcut path (1×1 projection for downsampling/channel matching).

This way, every convolution (weight layer) in the entire block, on both paths, is preceded by BN → ReLU. There is no ReLU after the residual addition (to preserve clean identity-like signal propagation). BN and ReLU are applied to shortcut path only when downsampling or channel matching. Otherwise, shortcut path has is pure identity path. In non-full (or partial) pre-activation variants, BN → ReLU is applied only to the residual path. The shortcut (especially 1x1 projection shortcuts) operates on the raw, unnormalized/unactivated input.

A full pre-activation ResNet-50 implementation is in Appendix.

ResNet design details:

  • The initial layers is called stem, which reduces the resolution.
  • ResNet blocks are organized into stages. Resolution reduction only happens at the beginning of each stage.
  • The first block in each stage reduce resolution by strided convolution except the block in the first stage.
  • Bottleneck block is only used on ResNet-50 or above. There are 3 conv layers in bottleneck block, while 2 conv layers in non-bottleneck (basic) block.
  • Those 1x1 conv which are employed in shortcut paths are not counted in the literal 50 of ResNet-50.

DenseNet

The basic idea of DenseNet [7] is Feature Reuse. In each block, every layer receives inputs from all previous layers. DenseNet uses concatenation operation, not addition. Each layer contributes a few more feature maps (growth rate) based on all feature maps learned by previous layers in one block.

  • feature reuse by channel concatenation
  • pre-activation block design
  • transtion block to reduce resolution by average pooling layer and reduce channel number by 1x1 conv
  • the number of feature map doesn’t have to be doubled every time (progressive increase), highly parameter-efficient
class DenseBlock(nn.Module):

    def __init__(self, n_layer, growth_rate):
        super().__init__()
        self.layers = nn.ModuleList()
        for _ in range(n_layer):
            self.layers.append(nn.Sequential(
                nn.LazyBatchNorm2d(),
                nn.ReLU(),
                nn.LazyConv2d(growth_rate, kernel_size=3, padding=1, bias=False)
            ))

    def forward(self, x):
        for layer in self.layers:
            y = layer(x)
            x = torch.cat((x,y), dim=1)
        return x


class DenseBotBlock(nn.Module):

    def __init__(self, n_layer, growth_rate, factor=4):
        super().__init__()
        self.layers = nn.ModuleList()
        for _ in range(n_layer):
            self.layers.append(nn.Sequential(
                nn.LazyBatchNorm2d(),
                nn.ReLU(),
                nn.LazyConv2d(growth_rate*factor, kernel_size=1, bias=False),
                nn.LazyBatchNorm2d(),
                nn.ReLU(),
                nn.LazyConv2d(growth_rate, kernel_size=3, padding=1, bias=False)
            ))

    def forward(self, x):
        for layer in self.layers:
            y = layer(x)
            x = torch.cat((x,y), dim=1)
        return x


class DenseTransition(nn.Module):

    def __init__(self, n_channel=None):
        """ when n_channel is None, no channel compression """
        super().__init__()
        
        self.transition = nn.Sequential(
            nn.LazyBatchNorm2d(),
            nn.ReLU(),
            nn.LazyConv2d(n_channel,
                          kernel_size=1,
                          bias=False) if n_channel else nn.Sequential(),
            nn.AvgPool2d(kernel_size=2, stride=2)
        )

    def forward(self, x):
        return self.transition(x)

ResNeXt

ResNeXt [8] introduces grouped convolution into ResNet. Actually, the core idea of ResNeXt is not grouped convolution. It’s called Aggregated Residual Transformation, and each transformaion path is required to be independent. It’s just accidentally and mathematically the same as grouped convolution.

Grouped convolution is a variation of the standard convolution operation that’s crucial for architectures like ResNeXt and MobileNet. It was originally introduced in the AlexNet architecture to distribute the model across multiple GPUs. Here the grouped convolution is mainly used to improve computational efficiency. Grouped Convolution splits both the input feature maps and the convolutional filters into a pre-defined number of groups (G). The convolution then happens independently within each group. Finally, concatenation is applied on all outputs from each group. Grouped convolution is also a branching design but with a single line of PyTorch code to realize.

The computation complexity of grouped convolution:

  • complexity of convolution: $O(C_{in}\cdot C_{out})$
  • complexity of grouped convolution: $O(g\cdot C_{in}/g \cdot C_{out}/g)=O((C_{in}\cdot C_{out})/g)$
class ResNeXtBlock(nn.Module):
    """ post-activation ResNeXt bottleneck block """

    def __init__(self, n_group, n_channel, factor, stride, use_1x1conv):
        # here cannot check the input channel and n_group,
        assert n_channel%factor == 0
        assert (n_channel//factor) % n_group == 0
        assert stride in (1, 2)
        assert factor in (1, 2, 4)
        super().__init__()

        if use_1x1conv:
            self.shortcut = nn.Sequential(
                nn.LazyConv2d(n_channel, kernel_size=1,
                                         stride=stride,
                                         bias=False),
                nn.LazyBatchNorm2d()
            )
        else:
            self.shortcut = nn.Sequential()

        # 1x1 --> 3x3 (grouped conv) --> 1x1
        self.seq = nn.Sequential(
            nn.LazyConv2d(n_channel//factor, kernel_size=1, bias=False),
            nn.LazyBatchNorm2d(),
            nn.ReLU(),
            nn.LazyConv2d(n_channel//factor, kernel_size=3,
                                             padding=1,
                                             stride=stride,
                                             groups=n_group,
                                             bias=False),
            nn.LazyBatchNorm2d(),
            nn.ReLU(),
            nn.LazyConv2d(n_channel, kernel_size=1, bias=False),
            nn.LazyBatchNorm2d()
        )

    def forward(self, x):
        return F.relu(self.seq(x)+self.shortcut(x))

ResNeXt only has bottleneck design. 1x1 conv is used for channel compressing or expansion so that they could not be accounted in the independent transformation path. In a basic ResNet block, there are two consecutive 3x3 conv, apply grouped convolution on either 3x3 conv layer would break the independent transformation path, and this implementation is called Trivial Wider ResNet block in paper and loss cardinality. Nonetheless, like ResNet, ResNeXt also has post-activation (V1) and pre-activation (V2) versions.

MobileNet (V1)

Aimed to reduce computation complexity and lantency, MobileNet [9] employs Depthwise Separable Convolution. There are two separate steps:

  • depthwise convolution (grouped conv, group number is equal the number of input channel), each channel is only convoluted by only one filter (it’s like to apply a mapping upon each channel)
  • pointwise convolution (1x1 conv), combine all outputs from each channel together

Fully convolutional operation is factorized into two separable steps above. Strided convolution is employed in MobileNet to reduce resolution. No skip connection in MobileNet V1.

class DSConv(nn.Module):
    """ Depthwise Separable Conv of MobileNet V1 """
    
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        
        # depthwise convolution, keep the channel number
        self.depthwise = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=3, 
                      stride=stride, padding=1, groups=in_channels, bias=False),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True)
        )
        
        # pointwise convolution, change the channel number
        self.pointwise = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, 
                      stride=1, padding=0, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

MobileNet (V2)

MobileNet V2 [10] introduced an inverted residual and linear bottleneck block.

Inverted residual structure expands channel width first, and then apply depthwise convolution. Finally, pointwise convolution is employed to shrink the width back (thin –> wide –> thin). In ResNet’s bottleneck design, the channel width is shrank first (wide –> thin –> wide). So, here in MobileNet V2, we call it Inverted.

class InvertedResidualBlock(nn.Module):
    """ Inverted Residual Block - the core building block of MobileNet V2
    
    Structure:
    1. Expansion:  1x1 conv to expand channels (thin -> wide)
    2. Depthwise:  3x3 depthwise conv for spatial filtering
    3. Projection: 1x1 conv to project back (wide -> thin), no activation
    4. Skip connection is applied only if dimensions match and no resolution reduction
    """
    def __init__(self, in_channels, out_channels, stride, expand_ratio):
        super().__init__()
        self.use_residual = (stride == 1 and in_channels == out_channels)
        hidden_dim = int(in_channels * expand_ratio)
        layers = []
        
        # expansion
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden_dim, kernel_size=1, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True)
            ])
        
        # depthwise convolution
        layers.extend([
            nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride,
                      padding=1, groups=hidden_dim, bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.ReLU(inplace=True)
        ])

        # projection (linear bottleneck - no activation!)
        layers.extend([
            nn.Conv2d(hidden_dim, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels)
            # Note: NO ReLU here! This is the linear bottleneck
        ])
        
        self.blk = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_residual:
            return x + self.blk(x)
        else:
            return self.blk(x)

Linear in MobileNet V2 design means no activation at the end of the block. It is important to remove non-linearity in the narrow layers in order to maintain representational power [10].

Skip connnection is only applied when dimensions match and there is no resolution reduction. The layer structure looks like a post-activation design. However, since there is no activation at the end, it’s also like pre-activation. So, people don’t say these two terms when discussing MobileNet.

SENet

Squeeze and Excitation Network (SENet) [11] introduces an ingenuine channel weighting mechanism into existing CNN architectures. People call SE structure a lightweight attention mechanism.

class SEBlock(nn.Module):
    """  Squeeze-and-Excitation Block  """
    
    def __init__(self, in_channels, reduction_ratio=16):
        super().__init__()
        n_dim = in_channels // reduction_ratio
        
        self.seblk = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, n_dim, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(n_dim, in_channels, kernel_size=1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return x * self.seblk(x)

SE block is inserted just before shortcut connection in ResNet, after activation layer (post-act) or conv layer (pre-act). The general ideology of placement of SE block is that where do you want the “attention” happens! If Group Norm (GN) is employed, try to not make the bottleneck dimension smaller than the number of channel groups.

If changing ReLU in SE block to GELU (standard activation in transformer architectures), you might get a bit performance boost without introducing any learnable parameters. MBConv block uses lightweight SiLU which needs less computation than GELU. Both SiLU and GELU are smooth functions.

EfficientNet (MBConv)

The core idea of EfficientNet [12] is called Compound Scaling, which is a principled and balanced way to scale convolutional neural networks across three dimensions, depth (number of layers), width (number of channels), and resolution (input image size), by using a single compound coefficient. Previous scaling methods typically focused on one or two dimensions arbitrarily (e.g., deeper networks like ResNet, wider like Wide ResNet, or higher resolution inputs). These approaches often led to diminishing returns because the dimensions interact: increasing depth alone helps with feature complexity but wastes capacity if width or resolution are too small, and vice versa.

The building block in EfficientNet is called MBConv block, which is actually the same design of MobileNetV2 inverted residual and linear bottleneck block, but with two enhancements:

  • employ SE block after depthwise conv layer (best place proved by [14])
  • employ SiLU (Sigmoid Linear Unit, Swish) activation exclusively in the whole network

EfficientNet (V2)

ShuffleNet

RegNet

ConvMixer

NLNet & GCNet

Non-Local neural Network (NLNet) [17] is the pioneer in attention mechanism design. It tries to solve the issue caused by conv layers by which only a small receptive field could be seen. We can stack lots of conv layers to enlarge the receptive field. However, this is not cheap way. NLNet block is like SENet block, which is also a plugin design and could be used in any CNN architecture. The basic idea is that each pixel computes with all other pixels to get an attention map, and finally add this attention map back to input.

However, researchers found that each pixel’s attention map is almost identical in NLNet. This is resonable. For a single image, important areas should be the same for each pixel. Based on this observation, Global Context neural Network (GCNet) [18] was design to replace NLNet as a lightweight global attention mechanism in CNN architecture.

class LazyGC(nn.Module):
    def __init__(self, ratio=16):
        super().__init__()
        self.ratio = ratio
        self.conv_mask = nn.LazyConv2d(1, kernel_size=1)
        self.softmax = nn.Softmax(dim=-1)
        self.transform = None

    def forward(self, x):
        batch, channels, _, _ = x.size()

        # --- 1. Context Modeling (Global Attention) ---
        # [B, C, H, W] -> [B, C, H*W]
        input_x = x.view(batch, channels, -1)       # [B, C, N]
        input_x = input_x.unsqueeze(1)              # [B, 1, C, N]
        # shared attention mask
        mask = self.conv_mask(x).view(batch, 1, -1) # [B, 1, N]
        mask = self.softmax(mask).unsqueeze(-1)     # [B, 1, N, 1]
        # weighted global average:   [B, 1, C, 1] --> [B, C, 1, 1]
        context = torch.matmul(input_x, mask).view(batch, channels, 1, 1)

        # --- 2. Transform (Bottleneck with LayerNorm) ---
        if self.transform is None:
            n_dim = max(channels // self.ratio, 1)
            self.transform = nn.Sequential(
                nn.Conv2d(channels, n_dim, kernel_size=1, bias=False),
                nn.GroupNorm(1, n_dim), #nn.LayerNorm((n_dim,1,1)),
                nn.SiLU(inplace=True),
                nn.Conv2d(n_dim, channels, kernel_size=1)
            )
            self.transform.to(x.device)

        # --- 3. Fusion (Addition) ---
        # Each channel adds a single value by broadcast, which is the
        # global context for that channel. It's calculated by weighted
        # sum of each channel and the Shared Attention Mask.
        return x + self.transform(context)

Researchers found that GC block consistently outperforms SE block becaseu GC block does two things together: scale channel and add global context information to each channel.

ConvNeXt

The researchers conducted a series of experiments started from ResNet with different designs borrowed from transformer architecture, especially Swin Transformer, and ended up with a pure ConvNet which outperforms transformers on some famous benchmark datasets. This is the “ConvNet for the 2020s”.

  • Change stage compute ratio, and make the 3rd stage even more computationally expensive. E.g., from (3,4,6,3) of ResNet-50 to (3,3,9,3). (smaller channel width)
  • Change stem to “pathify”. 4x4 conv with padding 4, non-overlapping.
  • Depthwise and 1x1 conv.
  • Inverted bottleneck with moving up depthwise conv layer.
  • Large kernel size. 7x7 conv with padding 3.
  • GELU and fewer activation layers. Only one GELU in each block, like transformer block.
  • Position-wise layer normalization.
  • Separate downsampling layers by 2x2 conv with stride 2 and increasing channel width. (Norm –> Conv)

The below implementation of postion-wise LN is from my blog post normalization, regularization and learning rate scheduling.

class FastPositionwiseLN(nn.Module):
    
    def __init__(self, n_channel):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(1, n_channel, 1, 1))
        self.beta  = nn.Parameter(torch.zeros(1, n_channel, 1, 1))

    def forward(self, x):
        # calculate mean and variance across the channel dimension (dim=1)
        mean = x.mean(1, keepdim=True)
        var  = x.var(1,  keepdim=True, unbiased=False)  # biased in original paper
        # standardization
        x = (x - mean) / torch.sqrt(var + 1e-8)
        return self.gamma*x + self.beta


class ConvNeXtBlock(nn.Module):
    """ ConvNeXt block without Layer Scale and Drop Path """

    def __init__(self, n_channel):
        super().__init__()
        self.net = nn.Sequential(
            # non-lazy conv is more safe due to the requirement of consistent channel width
            nn.Conv2d(n_channel, n_channel, kernel_size=7, padding=3, groups=n_channel, bias=False),
            FastPositionwiseLN(n_channel),
            nn.LazyConv2d(n_channel*4, kernel_size=1),
            nn.GELU(),
            nn.LazyConv2d(n_channel, kernel_size=1)
        )
    
    def forward(self, x):
        return x + self.net(x)

Reference

  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
  2. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  3. Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
  4. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
  5. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  6. He, K., Zhang, X., Ren, S., & Sun, J. (2016, September). Identity mappings in deep residual networks. In European conference on computer vision (pp. 630-645). Cham: Springer International Publishing.
  7. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708).
  8. Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492-1500).
  9. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
  10. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).
  11. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).
  12. Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
  13. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
  14. Hoang, V. T., & Jo, K. H. (2021, July). Practical analysis on architecture of EfficientNet. In 2021 14th International Conference on Human System Interaction (HSI) (pp. 1-4). IEEE.
  15. Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Cham: Springer International Publishing.
  16. Ioffe, S. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
  17. Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794-7803).
  18. Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 0-0).

Appendix

ReLU6

$relu6=min(max(0,x),6)$

MobileNet (introduced in 2017 by Google) is optimized for mobile and embedded vision applications, emphasizing low latency, small model size, and efficient inference on resource-constrained hardware. ReLU6 was chosen over CNN standard ReLU for several key reasons tied to these goals:

  • Quantization Compatibility: MobileNets are designed for deployment with low-precision quantization (e.g., 8-bit integers) to speed up inference and reduce memory footprint. ReLU6’s output range (0–6) maps efficiently to unsigned 8-bit values (0–255) (use 8 bits to represent float number from 0 to 6), minimizing quantization error and preserving accuracy during fixed-point arithmetic which is common on mobile CPUs/GPUs. The “6” is an empirical choice for this bit-width compression. Standard ReLU’s unbounded positives can lead to overflow or higher error in such setups.
  • Numerical Stability: By capping activations at 6, ReLU6 prevents extreme values from propagating through the network, reducing the risk of gradient explosions or instability during training, especially in deeper or wider models like MobileNet’s depthwise separable convolutions. This “keeps values small and within a manageable range.”
  • Mobile Hardware Efficiency: On edge devices (e.g., smartphones), where floating-point operations are expensive, ReLU6’s bounded range simplifies optimizations in frameworks like TensorFlow Lite. It was a deliberate choice in the original MobileNet architecture to align with quantized training pipelines.

Full Pre-Activation ResNet-50 Implementation

Got 74.6% top-1 accuracy by training on 4xGPU with standard 90 epochs and other configurations.

class LazyBatchNormAct2d(nn.Module):

    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.LazyBatchNorm2d(),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.net(x)


class PreBotResBlock(nn.Module):

    def __init__(self, n_channel, stride=1, factor=4, use_1x1conv=False):
        assert n_channel % factor == 0
        assert stride in (1, 2)
        super().__init__()
        self.use_1x1conv = use_1x1conv
        self.bna = LazyBatchNormAct2d()

        if use_1x1conv:
            self.shortcut = nn.LazyConv2d(n_channel, kernel_size=1,
                                                     stride=stride,
                                                     bias=False)
        else:
            self.shortcut = nn.Identity()

        self.seq = nn.Sequential(
            nn.LazyConv2d(n_channel//factor, kernel_size=1, bias=False),
            LazyBatchNormAct2d(),
            nn.LazyConv2d(n_channel//factor, kernel_size=3,
                                             stride=stride,
                                             padding=1,
                                             bias=False),
            LazyBatchNormAct2d(),
            nn.LazyConv2d(n_channel, kernel_size=1, bias=False)
        )

    def forward(self, x):
        y = self.bna(x)
        z = self.shortcut(y) if self.use_1x1conv else self.shortcut(x)
        return self.seq(y) + z


class PreResNet50(nn.Module):
    """ Full Pre-activation bottleneck ResNet-50 for ImageNet-1k
        
        Implemented by Xinlin Zhang (https://xinlin-z.github.io)
    """

    def __init__(self):
        super().__init__()

        stages = nn.Sequential(
            # stage 1, 3 blocks
            PreBotResBlock(256, use_1x1conv=True),
            PreBotResBlock(256),
            PreBotResBlock(256),
            # stage 2, 4 blocks
            PreBotResBlock(512, stride=2, use_1x1conv=True),
            PreBotResBlock(512),
            PreBotResBlock(512),
            PreBotResBlock(512),
            # stage 3, 6 blocks
            PreBotResBlock(1024, stride=2, use_1x1conv=True),
            PreBotResBlock(1024),
            PreBotResBlock(1024),
            PreBotResBlock(1024),
            PreBotResBlock(1024),
            PreBotResBlock(1024),
            # stage 4, 3 blocks
            PreBotResBlock(2048, stride=2, use_1x1conv=True),
            PreBotResBlock(2048),
            PreBotResBlock(2048),
        )

        self.net = nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),  # (64,56,56)
            stages,
            nn.AdaptiveAvgPool2d(1),
            nn.LazyConv2d(1000, kernel_size=1),
            nn.Flatten(start_dim=1)
        )

    def forward(self, x):
        return self.net(x)

Deconv vs. Transposed Conv vs. Upsampling+Conv

Deconvolution is the reversed operation of standard convolution.

Transposed convolution is not Deconvolution. However, in some context, these two terms are used interchangably. They are only similar in the sense of the same output spatial dimensions. Transposed convolution doesn’t reverse the standard convolution by values, rather by dimensions only. It does exactly what a standard convolutional layer does but on a modified input feature map.

How is the input feature map modified?

  • $s$ is stride, $p$ is padding, $k$ is kernel size, imagine that they are all applied on output to generate input.
  • Between each row and column in the input, insert $z=s-1$ zeros.
  • Pad the zero-inserting feature map with $p’=p-k-1$ zeros.
  • Carry out standard convolution on the padded feature map with hidden stride=1.

Upsampling+Conv is like transposed conv, but employing upsampling algorithm (bilinear, bicubic, or nearest neighbor) to resize the input feature maps, and then followed by standard 3x3 conv layer. Modern CNN architectures perfer this way because:

  • output is more smooth and with natural boundaries (might blur sharp edges), no checkerboard artifacts (transposed conv often has)
  • more efficient and stable

Weights Init for CNNs

Recommend using Kaiming initialization for more stable training and fast converging. There are a few other names for Kaiming initialization:

  • Kaiming initialization
  • MSRA initialization
  • He initialization
  • Kaiming normalization

They are all the same thing!

Code pattern:

class XXNet(nn.Module):

    def __init__(self, *args, **kwargs):
        super().__init__()
        ...  # code for building layers (non-lazy)
        self.apply(self.init_conv_kaiming_normal_relu)
    
    def init_conv_kaiming_normal_relu(self, m):
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    
    def forward(self, x):
        ...