14 Matching Annotations
- Feb 2021
-
arxiv.org arxiv.org
-
A recent extension [8] fuses the spatial and flow streamsafter the last network convolutional layer, showing someimprovement on HMDB while requiring less test time aug-mentation (snapshot sampling). Our implementation fol-lows this paper approximately using Inception-V1. The in-puts to the network are 5 consecutive RGB frames sam-pled 10 frames apart, as well as the corresponding opticalflow snippets. The spatial and motion features before thelast average pooling layer of Inception-V1 (5×7×7fea-ture grids, corresponding to time, x and y dimensions) arepassed through a3×3×33D convolutional layer with 512output channels, followed by a3×3×33D max-poolinglayer and through a final fully connected layer. The weightsof these new layers are initialized with Gaussian noise
Two-Stream Networks
-
For this paper we implemented a small variation of C3D[31], which has8convolutional layers,5pooling layers and2fully connected layers at the top. The inputs to the modelare short16-frame clips with112×112-pixel crops as inthe original implementation. Differently from [31] we usedbatch normalization after all convolutional and fully con-nected layers. Another difference to the original model isin the first pooling layer, we use a temporal stride of2in-stead of1, which reduces the memory footprint and allowsfor bigger batches – this was important for batch normal-ization (especially after the fully connected layers, wherethere is no weight tying). Using this stride we were able totrain with 15 videos per batch per GPU using standard K40GPUs
C3D
-
The model is trained using cross-entropy losses on theoutputs at all time steps. During testing we consider onlythe output on the last frame. Input video frames are sub-sampled by keeping one out of every 5, from an original 25frames-per-second stream. The full temporal footprint of allmodels is given in table 1
ConvNet+ LSTM
-
In this paper we compare and study a subset of modelsthat span most of this space. Among 2D ConvNet meth-ods, we consider ConvNets with LSTMs on top [5, 37], andtwo-stream networks with two different types of stream fu-sion [8, 27]. We also consider a 3D ConvNet [14, 30]: C3D
comparison
-
The modeltermed a “Two-Stream Inflated 3D ConvNets” (I3D), buildsupon state-of-the-art image classification architectures, butinflatestheir filters and pooling kernels (and optionally theirparameters) into 3D, leading to very deep, naturally spatio-temporal classifiers. An I3D model based on Inception-v1 [13] obtains performance far exceeding the state-of-the-art, after pre-training on Kinetics
concept
-
Our experimental strategy is to reimplement a number ofrepresentative neural network architectures from the litera-ture, and then analyze their transfer behavior by first pre-training each one on Kinetics and then fine-tuning each onHMDB-51 and UCF-101.
Initial goal of the experiment
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
-
we train and evaluate models with clips of 8 frames (T= 8)by skipping every other frame (all videos are pre-processedto 30fps, so the newly-formed clips are effectively at 15fps)
augmentation
-
Base architecture. We useResNet3D, presented in Table 1,as our base architecture for most of our ablation experi-ments in this section. More specifically, our model takesclips with a size of T×224×224 whereT= 8is the num-ber of frames,224is the height and width of the croppedframe. Two spatial downsampling layers (1×2×2) are ap-plied atconv1and atpool1, and three spatiotemporaldownsampling (2×2×2) are applied atconv31,conv41andconv51 via convolutional striding. A global spa-tiotemporal average pooling with kernel sizeT8×7×7 is ap-plied to the final convolutional tensor, followed by a fully-connected (fc) layer performing the final classification
260K videos
-
Interaction-reducedchannel-separatedbottleneckblockis derived from the preserved bottleneck block byremoving the extra 1×1×1 convolution. This yields thedepthwise bottleneck block shown in Figure 2(c). Notethat the initial and final 1×1×1 convolutions (usually inter-preted respectively as projecting into a lower-dimensionalsubspace and then projecting back to the original dimen-sionality) are now the only mechanism left for channelinteractions. This implies that the complete block shown in(c) has a reduced number of channel interactions comparedwith those shown in (a) or (b). We call this design aninteraction-reducedchannel-separated bottleneck blockand the resulting architecture aninteraction-reducedchannel-separated network(ir-CSN).
interaction-reduced channel-separated block
-
Interaction-preservedchannel-separatedbottleneckblockis obtained from the standard bottleneck block (Fig-ure 2(a) by replacing the 3×3×3 convolution in (a) witha 1×1×1 traditional convolution and a 3×3×3 depthwiseconvolution (shown in Figure 2(b)). This block reducesparameters and FLOPs of the traditional 3×3×3 convo-lution significantly, but preserves all channel interactionsvia a newly-added 1×1×1 convolution. We call this aninteraction-preservedchannel-separated bottleneck blockand the resulting architecture aninteraction-preservedchannel-separated network(ip-CSN).
interaction-preserved channel-separated network
-
Thesereductions occur because each filter in a group receives in-put from only a fraction1/Gof the channels from the pre-vious layer. In other words, channel grouping restricts fea-ture interaction: only channels within a group can inter-act.
reductions by grouping
-
Conventional convolution is imple-mented with dense connections, i.e., each convolutional fil-ter receives input from all channels of its previous layer, asin Figure 1(a). However, in order to reduce the computa-tional cost and model size, these connections can be sparsi-fied by grouping convolutional filters into subsets.
conventional convolution
Tags
Annotators
URL
-
- Jan 2021
-
arxiv.org arxiv.org
-
ARTNet [34] decouples spatial andtemporal modeling into two parallel branches. Similarly,3D convolutions can also be decomposed into a Pseudo-3Dconvolutional block as in P3D [25] or factorized convolu-tions as in R(2+1)D [32] or S3D [40]. 3D group convolutionwas also applied to video classification in ResNeXt [16] andMulti-Fiber Networks [5] (MFNet)
decomposition of model
-
P3D [25],R(2+1)D [32], and S3D [40]. In these architectures, a 3Dconvolution is replaced with a 2D convolution (in space)followed by a 1D convolution (in time). This factoriza-tion can be leveraged to increase accuracy and/or to reducecomputation.
3D convolution architectures
Tags
Annotators
URL
-