Abstract:The breakthrough of deep learning in the field of image makes the rapid development of feature learning. Aiming at the temporal correlation of consecutive frames in video sequences, a residual 3D convolutional network model based on attention mechanism is proposed for human action recognition. Firstly, residual 3D convolution network is used to learn the temporal correlation between consecutive video frames in video sequence. Then, each feature channel learned by residual 3D convolution structure is given different weights by using channel attention network which is extended to three-dimensional. Finally, the reweighted features are input into the classifier to get the final classification. Experiments are carried out on UCF-101 and HMDB-51 datasets, and the accuracy is 95.8% and 69.7%, respectively. The experimental results show that the proposed model has high recognition accuracy in video human action recognition.