    1. A recent extension [8] fuses the spatial and flow streamsafter the last network convolutional layer, showing someimprovement on HMDB while requiring less test time aug-mentation (snapshot sampling). Our implementation fol-lows this paper approximately using Inception-V1. The in-puts to the network are 5 consecutive RGB frames sam-pled 10 frames apart, as well as the corresponding opticalflow snippets. The spatial and motion features before thelast average pooling layer of Inception-V1 (5×7×7fea-ture grids, corresponding to time, x and y dimensions) arepassed through a3×3×33D convolutional layer with 512output channels, followed by a3×3×33D max-poolinglayer and through a final fully connected layer. The weightsof these new layers are initialized with Gaussian noise

      Two-Stream Networks

    2. For this paper we implemented a small variation of C3D[31], which has8convolutional layers,5pooling layers and2fully connected layers at the top. The inputs to the modelare short16-frame clips with112×112-pixel crops as inthe original implementation. Differently from [31] we usedbatch normalization after all convolutional and fully con-nected layers. Another difference to the original model isin the first pooling layer, we use a temporal stride of2in-stead of1, which reduces the memory footprint and allowsfor bigger batches – this was important for batch normal-ization (especially after the fully connected layers, wherethere is no weight tying). Using this stride we were able totrain with 15 videos per batch per GPU using standard K40GPUs


    3. The model is trained using cross-entropy losses on theoutputs at all time steps. During testing we consider onlythe output on the last frame. Input video frames are sub-sampled by keeping one out of every 5, from an original 25frames-per-second stream. The full temporal footprint of allmodels is given in table 1

      ConvNet+ LSTM

    4. In this paper we compare and study a subset of modelsthat span most of this space. Among 2D ConvNet meth-ods, we consider ConvNets with LSTMs on top [5, 37], andtwo-stream networks with two different types of stream fu-sion [8, 27]. We also consider a 3D ConvNet [14, 30]: C3D


    5. The modeltermed a “Two-Stream Inflated 3D ConvNets” (I3D), buildsupon state-of-the-art image classification architectures, butinflatestheir filters and pooling kernels (and optionally theirparameters) into 3D, leading to very deep, naturally spatio-temporal classifiers. An I3D model based on Inception-v1 [13] obtains performance far exceeding the state-of-the-art, after pre-training on Kinetics


    6. Our experimental strategy is to reimplement a number ofrepresentative neural network architectures from the litera-ture, and then analyze their transfer behavior by first pre-training each one on Kinetics and then fine-tuning each onHMDB-51 and UCF-101.

      Initial goal of the experiment

    1. we train and evaluate models with clips of 8 frames (T= 8)by skipping every other frame (all videos are pre-processedto 30fps, so the newly-formed clips are effectively at 15fps)


    2. Base architecture. We useResNet3D, presented in Table 1,as our base architecture for most of our ablation experi-ments in this section. More specifically, our model takesclips with a size of T×224×224 whereT= 8is the num-ber of frames,224is the height and width of the croppedframe. Two spatial downsampling layers (1×2×2) are ap-plied atconv1and atpool1, and three spatiotemporaldownsampling (2×2×2) are applied atconv31,conv41andconv51 via convolutional striding. A global spa-tiotemporal average pooling with kernel sizeT8×7×7 is ap-plied to the final convolutional tensor, followed by a fully-connected (fc) layer performing the final classification

      260K videos

    3. Interaction-reducedchannel-separatedbottleneckblockis derived from the preserved bottleneck block byremoving the extra 1×1×1 convolution. This yields thedepthwise bottleneck block shown in Figure 2(c). Notethat the initial and final 1×1×1 convolutions (usually inter-preted respectively as projecting into a lower-dimensionalsubspace and then projecting back to the original dimen-sionality) are now the only mechanism left for channelinteractions. This implies that the complete block shown in(c) has a reduced number of channel interactions comparedwith those shown in (a) or (b). We call this design aninteraction-reducedchannel-separated bottleneck blockand the resulting architecture aninteraction-reducedchannel-separated network(ir-CSN).

      interaction-reduced channel-separated block

    4. Interaction-preservedchannel-separatedbottleneckblockis obtained from the standard bottleneck block (Fig-ure 2(a) by replacing the 3×3×3 convolution in (a) witha 1×1×1 traditional convolution and a 3×3×3 depthwiseconvolution (shown in Figure 2(b)). This block reducesparameters and FLOPs of the traditional 3×3×3 convo-lution significantly, but preserves all channel interactionsvia a newly-added 1×1×1 convolution. We call this aninteraction-preservedchannel-separated bottleneck blockand the resulting architecture aninteraction-preservedchannel-separated network(ip-CSN).

      interaction-preserved channel-separated network

    5. Thesereductions occur because each filter in a group receives in-put from only a fraction1/Gof the channels from the pre-vious layer. In other words, channel grouping restricts fea-ture interaction: only channels within a group can inter-act.

      reductions by grouping

    6. Conventional convolution is imple-mented with dense connections, i.e., each convolutional fil-ter receives input from all channels of its previous layer, asin Figure 1(a). However, in order to reduce the computa-tional cost and model size, these connections can be sparsi-fied by grouping convolutional filters into subsets.

      conventional convolution

    1. ARTNet [34] decouples spatial andtemporal modeling into two parallel branches. Similarly,3D convolutions can also be decomposed into a Pseudo-3Dconvolutional block as in P3D [25] or factorized convolu-tions as in R(2+1)D [32] or S3D [40]. 3D group convolutionwas also applied to video classification in ResNeXt [16] andMulti-Fiber Networks [5] (MFNet)

      decomposition of model

    2. P3D [25],R(2+1)D [32], and S3D [40]. In these architectures, a 3Dconvolution is replaced with a 2D convolution (in space)followed by a 1D convolution (in time). This factoriza-tion can be leveraged to increase accuracy and/or to reducecomputation.

      3D convolution architectures