1.简介
Convolutional neural networks (CNNs) have proven to be effective models for tackling a variety of visual tasks [21, 27, 33, 45]. For each convolutional layer, a set of filters are learned to express local spatial connectivity patterns along input channels. In other words, convolutional filters are expected to be informative combinations by fusing spatial and channel-wise information together within local receptive fields. By stacking a series of convolutional layers interleaved with non-linearities and down sampling, CNNs are capable of capturing hierarchical patterns with global receptive fields as powerful image descriptions.
CNN被证明是用于各类视觉任务的有效模型[21, 27, 33, 45].。每一个卷积层通过输入通道来学习一组滤波器来表示局部空间连接模式。换句话说,卷积过滤器希望通过在局部感受野内融合空间信息和通道信息来实现幼小的信息融合。通过堆积一系列的交错的非线性和下采样的卷积层,CNN能够捕捉到具有全局感受野的分层模式作为图像最强有力的图像描述。
Recent work has demonstrated that the performance of networks can be improved by explicitly embedding learning mechanisms that help capture spatial correlations without requiring additional supervision. One such approach was popularised by the Inception architectures [16, 43], which showed that the network can achieve competitive accuracy by embedding multi-scale processes in its modules. More recent work has sought to better model spatial dependence [1, 31] and incorporate spatial attention [19].
近期的工作证明,网络性能的提升可以通过显式的嵌入式学习机制来帮助捕获空间的相关性而不需要额外的监督。其中一种方法流行的方法是Inception模块,这个方法展示了该网络通过在模块中嵌入多尺度处理的方法来达到有竞争力的准确度。最近更多的工作是去寻找最好的模型空间相关性和加入空间注意力。
In this paper, we investigate a different aspect of architectural design - the channel relationship, by introducing a new architectural unit, which we term the “Squeeze-and-Excitation” (SE) block. Our goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.
在这篇论文中,我们研究了不同的结构设计方式—— 通道关系,通过引入一个我们称之为“Squeeze-and-Excitation(SE)” block的新结构单。我们的目的是通过显式模型相关性和卷积特征的通道信息来提升网络的表示能力。要做到这一点,我们提出了一个允许网络去执行特征重新校正的机制,它可以通过学习使用全局信息,然后有选择性的去强调特征和抑制不太有用的特征。
The basic structure of the SE building block is illustrated in Fig. 1. For any given transformation \(F_{tr}:X \to U,X \in \mathbb{R}^{H' \times W' \times C'},U \in \mathbb{R}^{H \times W \times C}\),(e.g. a convolution or a set of convolutions), we can construct a corresponding SE block to perform feature recalibration as follows. The features \(U\) are first passed through a squeeze operation, which aggregates the feature maps across spatial dimensions \(H \times W\) to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. This is followed by an excitation peration, in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps \(U\) are then reweighted to generate the output of the SE block which can then be fed directly into subsequent layers.
SE结构块的基础结构如图1所示。对于给定的所有转换\(F_{tr}:X \to U,X \in \mathbb{R}^{H' \times W' \times \C'},U \in \mathbb{R}^{H \times W \times C}\),(例如,一个卷积或者一个卷积组),我们允许构建一个类似的SE块去执行特征重新校正。特征\(U\)第一次通过挤压操作(squeeze operation),这个操作将会合并计算特征映射通过\(H \times W\)的维度空间所产生的通道信息描述块。这个信息描述块嵌入了通道特征响应的全局分布,能够是从网络的全局感受野的信息被低层网络所利用。接下来通过刺激操作(excitation operation),
====================
未完