Abstract:Human action recognition is one of the research hot-spots in the field of computer vision. It has far-reaching theoretical research significance in human-computer interaction, video surveillance and so on. In order to solve the problem that 2D CNN can not effectively obtain time relationship, based on the advantages of Transformer in modeling long-term dependency, Transformer structure is introduced and combined with 2D CNN for human action recognition to better capture context time information. Firstly, 2D CNN integrating channel-spatial attention module is used to capture the inter spatial features. Then, Transformer is used to capture the temporal feature between frames. Finally, MLP head is used for action classification. The experimental results show that the recognition accuracy of HMDB-51 datasets and UCF-101 datasets is 69.4% and 95.5% respectively.