【深度学习与图像识别：原理与实践】实战篇

本文系统梳理深度学习在图像识别领域的核心技术与实战方法。从 CNN 架构演进史入手，覆盖训练技巧、迁移学习、目标检测、语义分割等关键方向，并提供完整的 PyTorch 图像分类训练与推理 pipeline。

一、CNN 架构演进史

1.1 AlexNet (2012) — 深度学习革命的起点

AlexNet 由 Alex Krizhevsky、Ilya Sutskever 和 Geoffrey Hinton 于 2012 年提出，在 ImageNet 竞赛中以 15.3% 的 top-5 错误率（第二名 26.2%）震惊了计算机视觉界。

AlexNet 架构（输入 224x224x3）：

Conv1: 96 个 11x11 核, stride=4 → 55x55x96
MaxPool: 3x3, stride=2 → 27x27x96
Conv2: 256 个 5x5 核, stride=1, padding=2 → 27x27x256
MaxPool: 3x3, stride=2 → 13x13x256
Conv3: 384 个 3x3 核, stride=1, padding=1 → 13x13x384
Conv4: 384 个 3x3 核, stride=1, padding=1 → 13x13x384
Conv5: 256 个 3x3 核, stride=1, padding=1 → 13x13x256
MaxPool: 3x3, stride=2 → 6x6x256
FC6: 4096 + Dropout(0.5)
FC7: 4096 + Dropout(0.5)
FC8: 1000 + Softmax

关键技术贡献：
- ReLU 激活函数（替代 tanh/sigmoid，解决梯度消失）
- Dropout 正则化（防止过拟合）
- 重叠池化（Overlapping Pooling）
- 数据增强（随机裁剪 + 水平翻转 + PCA 颜色增强）
- 双 GPU 并行训练（当时 GTX 580 3GB 显存受限）
- 局部响应归一化（LRN，后来被 BatchNorm 取代）

1.2 VGG (2014) — 简单而深刻的优雅

VGG 由牛津大学 Visual Geometry Group 提出。其核心思想是：用连续的小卷积核（3x3）替代大卷积核，在保持感受野的同时增加网络深度和非线性。

VGG16 架构（输入 224x224x3）：

Block 1: Conv3-64 → Conv3-64 → MaxPool  → 112x112x64
Block 2: Conv3-128 → Conv3-128 → MaxPool → 56x56x128
Block 3: Conv3-256 → Conv3-256 → Conv3-256 → MaxPool → 28x28x256
Block 4: Conv3-512 → Conv3-512 → Conv3-512 → MaxPool → 14x14x512
Block 5: Conv3-512 → Conv3-512 → Conv3-512 → MaxPool → 7x7x512
FC: 4096 → 4096 → 1000

感受野分析：
- 两个 3x3 卷积叠加 = 1 个 5x5 卷积的感受野
- 三个 3x3 卷积叠加 = 1 个 7x7 卷积的感受野
- 但参数量更少：3 层 3x3 = 27C^2，1 层 7x7 = 49C^2
- 非线性层更多（3 个 ReLU vs 1 个 ReLU）

VGG 的缺点：
- 参数量巨大（VGG16 ≈ 138M 参数）
- 占显存多（FC6/FC7 占总参数的 ~90%）

在 PyTorch 中实现 VGG 风格块：

import torch.nn as nn

class VGGBlock(nn.Module):
    def __init__(self, in_channels, out_channels, num_convs):
        super().__init__()
        layers = []
        for i in range(num_convs):
            layers.append(nn.Conv2d(
                in_channels if i == 0 else out_channels,
                out_channels, kernel_size=3, padding=1
            ))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
        self.block = nn.Sequential(*layers)

    def forward(self, x):
        return self.block(x)

1.3 Inception/GoogLeNet (v1-v4) — 多尺度特征融合

Inception v1（GoogLeNet）在 2014 年 ILSVRC 中夺冠。核心创新是 Inception 模块——在同一层中并行使用多种卷积核尺度。

Inception v1 模块：
- 1x1 卷积: 降维 + 跨通道信息融合
- 3x3 卷积: 中等感受野
- 5x5 卷积: 大感受野
- 3x3 max pooling: 提供额外的空间信息
- 所有分支拼接（concatenate）

辅助分类器（Auxiliary Classifier）：
- 在网络中间层添加分类头
- 训练时提供额外梯度（缓解深层网络梯度消失）
- 推理时移除

Inception v2 → v3 改进：
- 用两个 3x3 替代 5x5（与 VGG 思路一致）
- 用 1xn + nx1 卷积分解 nxn 卷积（降低参数）
- BatchNorm 的使用
- Label Smoothing（标签平滑正则化）

Inception v4 + Inception-ResNet：
- 引入残差连接到 Inception 结构
- 更深的网络 + 更快的收敛

1.4 ResNet (2015) — 残差学习的突破

ResNet 由何恺明等人提出，通过残差连接（Skip Connection）解决了深层网络的退化问题。

残差块的核心公式：
y = F(x, {Wi}) + x

其中 F 是需要学习的残差映射，x 是恒等映射（identity mapping）

为什么 ResNet 有效？
1. 梯度高速通道：在反向传播中，梯度可以直接通过 identity 路径传播
   即使 F 的梯度很小，identity 路径也能让梯度到达浅层
2. 恒等映射保护：即使新加的层什么都没学到（F → 0），
   模型至少保持原始性能（不会更差）
3. 残差更容易优化：F(x) = H(x) - x 比 H(x) 更容易学习
   因为残差通常比原始映射更接近零

Bottleneck 块（ResNet-50/101/152）：

class Bottleneck(nn.Module):
    expansion = 4  # 输出通道 = planes * expansion

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super().__init__()
        # 1x1 降维
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        # 3x3 空间卷积
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
                               padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        # 1x1 升维
        self.conv3 = nn.Conv2d(planes, planes * self.expansion, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # 残差连接
        out = self.relu(out)
        return out

ResNet 家族参数量对比：

模型	层数	参数量	Top-1 准确率	Top-5 准确率
ResNet-18	18	11.7M	69.76%	89.08%
ResNet-34	34	21.8M	73.31%	91.42%
ResNet-50	50	25.6M	76.13%	92.86%
ResNet-101	101	44.5M	77.37%	93.55%
ResNet-152	152	60.2M	78.31%	94.05%

1.5 DenseNet (2017) — 密集连接

DenseNet 将跳跃连接发挥到极致：每层的输入是前面所有层输出的拼接。

DenseNet 核心公式：
x_l = H_l([x_0, x_1, ..., x_{l-1}])

其中 H_l 是复合函数：BN → ReLU → 3×3 Conv

关键参数：
- Growth Rate (k)：每层新增的特征图通道数（通常较小，如 k=32）
- 第 l 层的输入通道数：k0 + k × (l-1)（k0 为输入通道数）

DenseNet 的优势：
1. 缓解梯度消失：每个层都直接连接 loss 和原始输入
2. 增强特征传播：浅层特征直接传到深层
3. 鼓励特征重用：多个层可以共享相同的低层特征
4. 大幅减少参数：由于 growth rate 较小且不需要重新学习冗余特征

1.6 EfficientNet (2019) — 复合缩放

EfficientNet 系统性地研究了如何均匀地缩放网络的深度、宽度和分辨率：

复合缩放公式：
depth:     d = α^φ
width:     w = β^φ
resolution: r = γ^φ

约束条件：α · β^2 · γ^2 ≈ 2
         (每提高一级，FLOPS 约翻倍)

通过神经架构搜索（NAS）找到基础网络 EfficientNet-B0，
然后用复合缩放得到 B1-B7。

EfficientNet-B0 的关键模块 — MBConv (Mobile Inverted Bottleneck)：
1. 1x1 升维卷积（Expand）
2. Depthwise 3x3 或 5x5 卷积
3. SE (Squeeze-and-Excitation) 注意力
4. 1x1 降维卷积（Project）
5. 残差连接（如果 stride==1 且形状匹配）

SE 模块（通道注意力）：
1. Global Average Pooling → (C, 1, 1)
2. FC → ReLU → FC → Sigmoid → (C, 1, 1)
3. 乘以原始特征图 → 加权每个通道的重要性

1.7 ConvNeXt (2022) — ResNet 的现代化

ConvNeXt 将 Vision Transformer 的设计理念注入 CNN，使传统卷积网络达到 Transformer 级别的性能：

ConvNeXt 对 ResNet 的五项现代化改造：

1. 阶段计算比例：ResNet 的 (3,4,6,3) → (3,3,9,3)（模仿 Swin Transformer）
2. 使用 Depthwise 卷积 + 1x1 卷积（类似 Transformer 的 spatial mixing + channel mixing）
3. 使用 LayerNorm 替代 BatchNorm（与 Transformer 一致）
4. 使用 GELU 替代 ReLU
5. 更大的 kernel size（7x7 vs 3x3）

ConvNeXt Block:
  d7x7 → LayerNorm → 1x1 → GELU → 1x1 → Layer Scale → Add Residual

Layer Scale 是一种可学习的缩放参数（类似 T5 中的做法），
初始化接近零以帮助深层梯度流动。

二、训练技巧集

2.1 标签平滑（Label Smoothing）

def label_smoothing_loss(logits, targets, smoothing=0.1):
    """
    标签平滑: 将真实标签从 one-hot 变为软标签
    y_smooth = (1 - alpha) * y_onehot + alpha / num_classes
    """
    num_classes = logits.size(-1)
    log_probs = F.log_softmax(logits, dim=-1)

    # 创建软目标
    with torch.no_grad():
        true_dist = torch.zeros_like(log_probs)
        true_dist.fill_(smoothing / (num_classes - 1))
        true_dist.scatter_(1, targets.unsqueeze(1),
                          1.0 - smoothing)

    loss = torch.sum(-true_dist * log_probs, dim=-1).mean()
    return loss

# 或直接使用 PyTorch 内置
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

2.2 Mixup 与 CutMix 数据增强

def mixup_data(x, y, alpha=1.0):
    """Mixup: 线性混合两个样本"""
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1.0

    batch_size = x.size(0)
    index = torch.randperm(batch_size, device=x.device)

    mixed_x = lam * x + (1 - lam) * x[index]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def mixup_criterion(criterion, pred, y_a, y_b, lam):
    """Mixup 的损失计算"""
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

# CutMix: 用另一张图像的矩形区域替换
def cutmix_data(x, y, alpha=1.0):
    lam = np.random.beta(alpha, alpha)
    rand_index = torch.randperm(x.size(0), device=x.device)

    # 生成矩形 mask
    b, _, h, w = x.shape
    cut_rat = np.sqrt(1.0 - lam)
    cut_w = int(w * cut_rat)
    cut_h = int(h * cut_rat)

    cx = np.random.randint(w)
    cy = np.random.randint(h)

    x1 = max(cx - cut_w // 2, 0)
    y1 = max(cy - cut_h // 2, 0)
    x2 = min(cx + cut_w // 2, w)
    y2 = min(cy + cut_h // 2, h)

    mixed_x = x.clone()
    mixed_x[:, :, y1:y2, x1:x2] = x[rand_index, :, y1:y2, x1:x2]

    # 调整 lambda 为实际面积比例
    lam = 1 - ((x2 - x1) * (y2 - y1)) / (w * h)

    return mixed_x, y, y[rand_index], lam

2.3 指数移动平均（EMA）

class ModelEma:
    """指数移动平均模型权重，用于推理时提高稳定性"""
    def __init__(self, model, decay=0.9999, device=''):
        self.ema_model = deepcopy(model)
        self.ema_model.eval()
        self.decay = decay
        self.device = device

    @torch.no_grad()
    def update(self, model):
        """每次训练步调用，更新 EMA 权重"""
        for ema_param, model_param in zip(
            self.ema_model.parameters(), model.parameters()
        ):
            if model_param.requires_grad:
                ema_param.data.mul_(self.decay).add_(
                    model_param.data, alpha=1 - self.decay
                )

        # Buffer 的更新（如 BN 的 running_mean/var）
        for ema_buffer, model_buffer in zip(
            self.ema_model.buffers(), model.buffers()
        ):
            ema_buffer.copy_(model_buffer)

# 使用示例
ema = ModelEma(model, decay=0.9999)
for epoch in range(epochs):
    for x, y in train_loader:
        loss = train_step(model, x, y)
        ema.update(model)

# 推理时使用 ema.ema_model
acc = validate(ema.ema_model, val_loader)

2.4 余弦学习率衰减 + Warmup

def cosine_scheduler(optimizer, warmup_epochs, total_epochs,
                     base_lr, min_lr=0.0):
    """带 warmup 的余弦退火学习率"""
    def lr_lambda(epoch):
        if epoch < warmup_epochs:
            # 线性 warmup
            return epoch / warmup_epochs
        else:
            # 余弦退火
            progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
            return min_lr + (base_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))

    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# Timm 中的余弦调度器（更精细）
from timm.scheduler import CosineLRScheduler
scheduler = CosineLRScheduler(
    optimizer,
    t_initial=total_epochs,
    lr_min=1e-6,
    warmup_t=5,
    warmup_lr_init=1e-6,
    warmup_prefix=True,  # warmup 阶段不减少步数
)

三、目标检测

3.1 两阶段检测器：R-CNN 家族演进

R-CNN (2014):
1. Selective Search ~2000 个候选区域
2. 每个区域缩放后通过 CNN 提取特征
3. SVM 分类 + 边界框回归
问题：2000 次 CNN 前向传播，极慢（~13 秒/图）

Fast R-CNN (2015):
1. 整个图像通过 CNN 提取特征图
2. Selective Search 候选区域映射到特征图上的 RoI
3. RoI Pooling 将不同大小的 RoI 统一到固定尺寸
4. FC + Softmax（分类）+ BBox 回归（定位）
改进：CNN 只跑一次，训练端到端

Faster R-CNN (2015):
核心创新：Region Proposal Network (RPN)
1. CNN Backbone 提取特征图
2. RPN 滑动窗口 → 每个位置 k 个 anchor (3 种尺度 × 3 种比例)
3. RPN 输出：每个 anchor 的 objectness score + bbox 偏移
4. NMS 筛选 ~300 个 proposals
5. RoI Pooling → 分类 + 回归
整个流程端到端可微分

RPN 的 Anchor 设计：
- scales: [128, 256, 512]（在输入图像上的像素）
- ratios: [0.5, 1, 2]（宽高比）
- 每个位置 9 个 anchor
- 特征图大小假设为 H×W，共 H×W×9 个 anchor

RPN 损失：
L = L_cls(objectness) + λ * L_reg(bbox offsets for positive anchors)

3.2 单阶段检测器：YOLO

YOLO 的核心思想：将检测问题转化为回归问题

YOLOv1:
- 将图像分为 S×S 网格（如 7×7）
- 每个网格预测 B 个边界框 + 类别概率
- 每个边界框：x, y, w, h, confidence
- 输出张量：S × S × (B*5 + C)

YOLOv3 改进：
- Darknet-53 骨干网络（类似 ResNet 的残差块）
- 多尺度预测（FPN 结构：13×13, 26×26, 52×52）
- 每个尺度 3 个 anchor（共 9 个，由 K-means 聚类得到）
- 类别预测使用 sigmoid（支持多标签）

YOLOv8 (Ultralytics, 2023)：
- CSP 骨干（跨阶段部分连接）
- Anchor-free 检测头
- 解耦头（cls 和 reg 分支独立）
- TaskAlignedAssigner 标签分配
- C2f 模块（Faster implementation of CSP Bottleneck with 2 convolutions）

YOLOv8 架构组件：
1. Backbone: C2f + SPPF (Spatial Pyramid Pooling Fast)
2. Neck: PAN-FPN (Path Aggregation Network + Feature Pyramid Network)
3. Head: Decoupled Head (cls branch + reg branch)

3.3 RetinaNet — Focal Loss 解决类别不平衡

单阶段检测器的核心问题：
大量 easy negative（背景）样本压倒 loss，
使得模型难以关注 hard 样本。

Focal Loss 公式：
FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)

- p_t: 真实类的预测概率
- γ: 聚焦参数（通常 γ=2）
- α_t: 类别平衡权重（通常 α=0.25）

当 p_t 很大（easy sample）→ (1-p_t)^γ 接近 0 → loss 很小
当 p_t 较小（hard sample）→ (1-p_t)^γ 接近 1 → loss 基本不变

直觉：降低 easy sample 的权重，让模型聚焦 hard sample

四、语义分割

4.1 FCN (Fully Convolutional Network)

FCN 是语义分割的开山之作：

1. 将分类网络（如 VGG）的 FC 层替换为 1x1 卷积
2. 通过转置卷积（Deconvolution / ConvTranspose）上采样到原图大小
3. Skip Connection 融合不同尺度的特征：
   FCN-32s: 直接从 conv7 上采样 32x（粗糙）
   FCN-16s: conv7 上采样 2x + pool4 → 上采样 16x
   FCN-8s:  + pool3 → 上采样 8x（更精细）

4.2 U-Net — 对称编码器-解码器

U-Net 架构特点：

Encoder（收缩路径）：
- 每个 block: 2 × (Conv3 + ReLU) + MaxPool2x2
- 通道数逐层翻倍（64 → 128 → 256 → 512 → 1024）

Decoder（扩展路径）：
- 每个 block: UpConv2x2 + Concat(skip connection) + 2 × (Conv3 + ReLU)
- 通道数逐层减半（1024 → 512 → 256 → 128 → 64）

Skip Connection：
- 将 Encoder 同层特征图与 Decoder 上采样的特征图在通道维拼接
- 这有助于恢复在下采样过程中丢失的精细空间信息

适用场景：医学图像分割（数据少，需要高精度边界）

U-Net PyTorch 实现：

class DoubleConv(nn.Module):
    """U-Net 的基本卷积块"""
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        return self.conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, out_channels=1):
        super().__init__()
        # Encoder
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.enc3 = DoubleConv(128, 256)
        self.enc4 = DoubleConv(256, 512)
        # Bottleneck
        self.bottleneck = DoubleConv(512, 1024)
        # Decoder
        self.up4 = nn.ConvTranspose2d(1024, 512, 2, stride=2)
        self.dec4 = DoubleConv(1024, 512)
        self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
        self.dec3 = DoubleConv(512, 256)
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = DoubleConv(256, 128)
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = DoubleConv(128, 64)
        # Output
        self.out = nn.Conv2d(64, out_channels, 1)

    def forward(self, x):
        e1 = self.enc1(x)                                        # (N, 64, H, W)
        e2 = self.enc2(F.max_pool2d(e1, 2))                     # (N, 128, H/2, W/2)
        e3 = self.enc3(F.max_pool2d(e2, 2))                     # (N, 256, H/4, W/4)
        e4 = self.enc4(F.max_pool2d(e3, 2))                     # (N, 512, H/8, W/8)

        b = self.bottleneck(F.max_pool2d(e4, 2))                 # (N, 1024, H/16, W/16)

        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))      # (N, 512, H/8, W/8)
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))     # (N, 256, H/4, W/4)
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))     # (N, 128, H/2, W/2)
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))     # (N, 64, H, W)

        return self.out(d1)

4.3 DeepLab — 空洞卷积

DeepLab 的核心技术：

1. Atrous/Dilated Convolution（空洞卷积）：
   - 在卷积核元素间插入空洞（膨胀率 r）
   - r=1: 标准卷积
   - r=2: 卷积核元素间距为 1（3x3 核等效为 5x5 感受野）
   - 在不增加参数和降低分辨率的情况下扩大感受野

2. Atrous Spatial Pyramid Pooling (ASPP)：
   - 并行使用多个不同膨胀率的空洞卷积
   - r = [1, 6, 12, 18]
   - + 1x1 卷积 + Global Average Pooling
   - 拼接后 1x1 卷积降维

3. DeepLabv3+ 改进：
   - 添加解码器模块（类似 U-Net 的 skip connection）
   - 使用深度可分离卷积降低计算量
   - Xception 骨干网络

五、完整 PyTorch 图像分类 Pipeline

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler
import torchvision
from torchvision import transforms, datasets, models
from tqdm import tqdm
import numpy as np

# === 配置 ===
config = {
    'data_dir': './data',
    'num_classes': 100,
    'batch_size': 128,
    'epochs': 90,
    'lr': 0.1,
    'momentum': 0.9,
    'weight_decay': 1e-4,
    'num_workers': 8,
    'seed': 42,
}

torch.manual_seed(config['seed'])
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# === 数据预处理 ===
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2, 0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

train_dataset = datasets.ImageFolder(f'{config["data_dir"]}/train', train_transform)
val_dataset = datasets.ImageFolder(f'{config["data_dir"]}/val', val_transform)

train_loader = DataLoader(
    train_dataset, batch_size=config['batch_size'], shuffle=True,
    num_workers=config['num_workers'], pin_memory=True, drop_last=True
)
val_loader = DataLoader(
    val_dataset, batch_size=config['batch_size'], shuffle=False,
    num_workers=config['num_workers'], pin_memory=True
)

# === 模型 ===
model = models.resnet50(weights=None, num_classes=config['num_classes'])
model = model.to(device)

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.SGD(
    model.parameters(),
    lr=config['lr'],
    momentum=config['momentum'],
    weight_decay=config['weight_decay'],
    nesterov=True
)
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=config['epochs']
)
scaler = GradScaler()

# === 训练循环 ===
def train_one_epoch(model, loader, criterion, optimizer, scaler, epoch):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    pbar = tqdm(loader, desc=f'Epoch {epoch}')
    for images, labels in pbar:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad(set_to_none=True)

        with autocast():
            outputs = model(images)
            loss = criterion(outputs, labels)

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        correct += predicted.eq(labels).sum().item()
        total += labels.size(0)

        pbar.set_postfix({
            'loss': f'{loss.item():.3f}',
            'acc': f'{100.*correct/total:.1f}%'
        })

    return running_loss / len(loader), 100. * correct / total

@torch.no_grad()
def validate(model, loader, criterion):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in tqdm(loader, desc='Validation'):
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        correct += predicted.eq(labels).sum().item()
        total += labels.size(0)

    return running_loss / len(loader), 100. * correct / total

# === 主循环 ===
best_acc = 0.0
for epoch in range(config['epochs']):
    train_loss, train_acc = train_one_epoch(
        model, train_loader, criterion, optimizer, scaler, epoch
    )
    scheduler.step()

    val_loss, val_acc = validate(model, val_loader, criterion)

    print(f'Epoch {epoch}: '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | '
          f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')

    if val_acc > best_acc:
        best_acc = val_acc
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'best_acc': best_acc,
        }, 'best_model.pth')
        print(f'  Saved best model with accuracy {best_acc:.2f}%')

print(f'Training completed. Best accuracy: {best_acc:.2f}%')

# === 推理 ===
@torch.no_grad()
def predict(image_path, model, transform, class_names, top_k=5):
    from PIL import Image

    model.eval()
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0).to(device)

    output = model(input_tensor)
    probs = torch.nn.functional.softmax(output, dim=1)

    top_probs, top_indices = probs.topk(top_k, dim=1)

    results = []
    for prob, idx in zip(top_probs[0].cpu().numpy(), top_indices[0].cpu().numpy()):
        results.append({
            'class': class_names[idx],
            'probability': float(prob)
        })

    return results

从 AlexNet 的破冰之旅，到 ResNet 的残差革命，再到 EfficientNet 的复合缩放和 ConvNeXt 的 CNN 现代化，图像识别领域在过去十年经历了从”手工设计特征”到”自动特征学习”，再到”架构自动搜索”的范式转变。理解这些经典架构的设计思想和训练技巧，是深入计算机视觉领域的必备基础。配合 PyTorch/TensorFlow 等现代框架，可以将这些理论高效地转化为实际应用。