Abstract:Aiming at the problems of low accuracy and low efficiency of complex human motion recognition in video, a dense connection network model for spatio-temporal feature extraction is proposed. Firstly, two dense connected networks are used to extract spatiotemporal features; Secondly, the dense connection between spatiotemporal networks is constructed, and the feature information extracted from the spatiotemporal network is input into the spatial flow network layer by layer to improve the spatiotemporal interaction between the two flows; Then the LSTM network is used to process the characteristics of the two stream network respectively, and the prediction results of the two streams are obtained; Finally, the prediction results of dual stream network are fused to realize the recognition of complex behaviors in video. The comparative experiments on ucf101 and hmdb51 benchmark data sets show that the accuracy rates of 94.69% and 68.87% are better than other algorithms. Experiments show that this model can increase the interaction between spatiotemporal networks and is conducive to the recognition of complex human actions.