Features Understanding in 3D CNNs for Actions Recognition in Video

Tenth International Conference on Image Processing Theory, Tools and Applications 2020 · Kazi Ahmed Asif Fuad, Pierre-Etienne Martin (1, 2), Romain Giot, Romain Bourqui, Jenny Benois-Pineau, Akka Zemmari ·

Human Action Recognition is one of the key tasks in video understanding. Deep Convolutional Neural Networks (CNN) are often used for this purpose. Although they usually perform impressively, their decision interpretation remains challenging. We propose a novel visual CNN features understanding technique. Its objective is to find salient features that played a key role in decision making of the network. The technique only uses the features from the last convolutional layer before the fully connected layers of a trained model and builds an importance map of features. The map is propagated to the original frame thus highlighting the regions in them that contribute to the final decision. The method is fast as it does not require gradient computation as many state-of-the-art methods do. Proposed technique is applied to the Twin Spatio-Temporal 3D Convolutional Neural Network (TSTCNN), designed for Table Tennis Actions recognition. Features visualization is performed at the RGB and Optical flow branches of the network. Obtained results are compared to other visualization techniques both in terms of human understanding and similarity metrics. The metrics show that generated maps are similar to those obtained with known Grad-CAM method, e.g. Pearson Correlation Coefficient between the maps generated of RGB data for Grad-CAM and our method is 0.7 ± 0.05 and 0.72 ± 0.06 on Optical Flow data

PDF Abstract