GoogLeNet 学习笔记

论文信息

论文作者：

Christian Szegedy (Google Inc.)
Wei Liu (University of North Carolina, Chapel Hill)
Yangqing Jia (Google Inc.)
Pierre Sermanet (Google Inc.)
Scott Reed (University of Michigan)
Dragomir Anguelov (Google Inc.)
Dumitru Erhan (Google Inc.)
Vincent Vanhoucke (Google Inc.)
Andrew Rabinovich (Google Inc.)

摘要：

We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

论文拆解

1. Introduction

略过。

3. Motivation and High Level Considerations

显然，提升深度网络性能最直接的做法就是加深（depth）和加宽（width），但直接这样做会有两个问题：过拟合（overfitting）和计算资源爆炸。

为解决这两个问题，作者提出的想法是，使用稀疏连接的深层网络。这个做法是有理论支撑的，Arora 的研究结果表明，若数据分布可由极稀疏的深层网络表示，则可通过分析神经元激活相关性，逐层构建最优拓扑。尽管严格的数学证明需要一些严格的条件，但这个理论和 Hebbian principle 是一致的：

Neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.

所以干就完了，吗？

虽然理论可行，但稀疏结构在硬件上并不高效。对当前的 CPU / GPU 来说，稀疏结构所导致的 Cache miss、索引开销远超算术节省，相比之下，高度优化的 dense GEMM 在现有硬件上是压倒性占优的。

于是作者考虑的是，能否“既要又要”：在结构上接近稀疏网络，同时在计算上仍使用高效的 dense 运算？

The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication.

以此为灵感，论文的第一作者设计并研究了 Inception 架构，起初这只是“用致密模块近似理论稀疏结构”的折中方案，但实践结果是，仅通过少量拓扑迭代，该架构就已优于参考网络，且在进一步调参后，其在检测与定位任务中表现突出。这个架构扛住了各种质疑，经反复实验验证，被证明至少是局部最优。

对此，作者仍抱有审慎态度：Inception 的成功未必完全来自上述理论，这个做法需要更完全的分析与验证。不过，Inception 的成功至少证明：沿“稀疏结构 + 致密计算”的方向探索是值得的。

4. Architectural Details

直接上图：

使用不同大小的卷积核提取多尺度特征，用卷积降维以减少计算量（同时还可充当 ReLU的作用），将各分支输出在通道维度拼接。

这就是 Inception 的核心模块。

对于整体网络：

层层堆叠 Inception 模块。
用步幅为的 max-pooling 减少尺寸。
底层使用传统卷积，中高层才用 Inception（非必需）。

这个架构的优点如下：

大幅增加网络深度和宽度的同时，控住了计算复杂度。
符合“视觉信息应通过融合不同尺度的特征来提取”的直觉。
Inception 模块可调程度高，可按需平衡性能与计算资源。

5. GoogLeNet

参数一览表：

结构图：

点我展开

或转 Netscope 查看更详细的网络结构。

要点一览：

输入：RGB with mean subtraction。
所有卷积使用 ReLU。
计参数的层数为层；若计 pooling 则层。
分类的 fc 前要 avg-pool + dropout。

为缓解深网络的梯度消失并增强中间层判别力，加入两个辅助分类器（接在 Inception(4a) 和 (4d) 后）：

训练阶段：辅助分类器的损失以权重加入总 loss，用于加梯度信号与正则化。
推理阶段：丢弃这些辅助分支（仅主分支推理）。

辅助分类器的结构：

avg-pooling (, stride =),for (4a),for (4d).
conv with 128 filters for dimension reduction and ReLU.
FC, ReLU, dropout()
A linear layer with softmax loss as the classifier (remove at inference time).

6. Training Methodology

GoogLeNet 使用 DistBelief 分布式系统训练，采用异步 SGD +动量，并使用固定学习率衰减策略（每个 epoch 衰减%），推理阶段通过 Polyak averaging 得到最终模型。

训练过程中大量依赖数据增强，包括随机裁剪（面积覆盖%、宽高比）、光照/颜色扰动以及多种随机插值方式。取样方式、超参数、训练策略等经过多次调整，堪比炼丹，归纳不出唯一最优方案。只知道多尺度裁剪和光度扰动对缓解过拟合效果显著。

7. ILSVRC 2014 Classification Challenge Setup and Results

任务是在 ImageNet 的类上进行分类，以 top-5 error 作为排名指标。

GoogLeNet 未使用任何外部数据，通过以下策略取得最佳成绩：

训练个 GoogLeNet 模型（含一个加宽版本）并进行集成；
测试阶段采用多尺度 + 多位置裁剪（个尺度多 crop翻转）；
综合多模型、多裁剪预测结果。

最终在验证集和测试集上均取得% 的 top-5 error，成为 SOTA，验证了 Inception 架构在高效率条件下的强大性能。

8. ILSVRC 2014 Detection Challenge Setup and Results

检测任务要求在类中预测目标类别与边界框，评价指标为 mAP（mean average precision）。

GoogLeNet 的检测方案基于 R-CNN 框架，但：

使用 Inception 网络作为区域分类器；
结合 Selective Search + Multi-box 提高候选框召回率；
通过减少候选框数量、提升覆盖率来降低误检；
对每个候选区域使用个模型的集成进行分类。

在未使用上下文信息、未进行 bounding box 回归的情况下，单模型 mAP 已具竞争力，集成后性能从约% 提升至%，充分体现 Inception 架构的能力。

9. Conclusions

Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.

稀疏化、模块化网络结构具有研究价值。

论文复现

PyTorch 实现 GoogLeNet

参考了这篇文章以及 AI 的现代化改进建议：

加入 Batch Normalization；
使用 nn.AdaptiveAvgPool2d 以支持任意输入尺寸；

import torch
import torch.nn as nn
import torch.nn.functional as F

class BasicConv2d(nn.Module):
    """基础卷积块：Conv + BN + ReLU"""
    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias = False, **kwargs) # 有 BN 就不需要 bias 了
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace = True)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        return x

class Inception(nn.Module):
    """Inception 模块"""
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super(Inception, self).__init__()
        # 1x1
        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size = 1)
        
        # 1x1 reduce -> 3x3
        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size = 1),
            BasicConv2d(ch3x3red, ch3x3, kernel_size = 3, padding = 1)
        )

        # 1x1 reduce -> 5x5
        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size = 1),
            BasicConv2d(ch5x5red, ch5x5, kernel_size = 5, padding = 2)
        )

        # 3x3 max-pool -> 1x1
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size = 3, stride = 1, padding = 1),
            BasicConv2d(in_channels, pool_proj, kernel_size = 1)
        )

    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        # 拼接为 1 维
        outputs = [branch1, branch2, branch3, branch4]
        return torch.cat(outputs, 1)

class InceptionAux(nn.Module):
    """Inception 辅助分类器"""
    def __init__(self, in_channels, num_classes):
        super(InceptionAux, self).__init__()
        self.avgpool = nn.AdaptiveAvgPool2d((4, 4))
        self.conv = BasicConv2d(in_channels, 128, kernel_size = 1) # output[batch, 128, 4, 4]
        
        self.fc1 = nn.Linear(128 * 4 * 4, 1024)
        self.fc2 = nn.Linear(1024, num_classes)
        self.dropout = nn.Dropout(0.7)
    
    def forward(self, x):
        # aux1: N x 512 x 14 x 14, aux2: N x 528 x 14 x 14
        x = self.avgpool(x)
        # aux1: N x 512 x 4 x 4, aux2: N x 528 x 4 x 4
        x = self.conv(x)
        # N x 128 x 4 x 4
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x), inplace = True)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

class GoogLeNet(nn.Module):
    def __init__(self, num_classes = 1000, aux_logits = True, init_weights = True):
        super(GoogLeNet, self).__init__()
        self.aux_logits = aux_logits

        # 1. Initial Conv Layers，将 ImageNet (224x224) 降维
        self.conv1 = BasicConv2d(3, 64, kernel_size = 7, stride = 2, padding = 3)
        self.maxpool1 = nn.MaxPool2d(3, stride = 2, ceil_mode = True)
        
        self.conv2 = BasicConv2d(64, 64, kernel_size = 1)
        self.conv3 = BasicConv2d(64, 192, kernel_size = 3, padding = 1)
        self.maxpool2 = nn.MaxPool2d(3, stride = 2, ceil_mode = True)

        # 2. Inception Blocks，对着表格实现即可。
        # Format: in_channels, 1x1, 3x3red, 3x3, 5x5red, 5x5, pool_proj
        self.inception3a = Inception(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride = 2, ceil_mode = True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(3, stride = 2, ceil_mode = True)

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        # 3. Auxiliary Classifiers，2 个辅助分类器，分别连接到 4a 和 4d
        self.aux1 = InceptionAux(512, num_classes) if aux_logits else None
        self.aux2 = InceptionAux(528, num_classes) if aux_logits else None
        
        # 4. Final Classifier，全局平均池化 + 全连接层
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.4)
        self.fc = nn.Linear(1024, num_classes)

        if init_weights:
            self._initialize_weights()
        
    def forward(self, x):
        # N x 3 x 224 x 224
        x = self.conv1(x)
        x = self.maxpool1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.maxpool2(x)

        # Inception 3
        x = self.inception3a(x)
        x = self.inception3b(x)
        x = self.maxpool3(x)

        # Inception 4
        x = self.inception4a(x)

        # 辅助输出 1
        if self.training and self.aux_logits:
            aux1 = self.aux1(x)

        x = self.inception4b(x)
        x = self.inception4c(x)
        x = self.inception4d(x)

        # 辅助输出 2
        if self.training and self.aux_logits:
            aux2 = self.aux2(x)

        x = self.inception4e(x)
        x = self.maxpool4(x)

        # Inception 5
        x = self.inception5a(x)
        x = self.inception5b(x)

        # 分类输出
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.dropout(x)
        x = self.fc(x)

        # 如果在训练且有辅助分类器，返回三个输出
        if self.training and self.aux_logits:
            return x, aux2, aux1
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

训练 & 评估

先咕咕咕

Loading...