Abstract:In the methon of skeleton action recognition based on graph convolution, the rely heavily on hand-designed graph topology in modelling joint features, and lack the ability to model global joint dependencies. To address this issue, we proposed a spatio-temporal convolutional Transformer network to implement the modelling of spatial and temporal joint features. In the spatial joint feature modeling, we proposed a dynamic grouping decoupling Transformer that grouped the input skeleton sequence in the channel dimension and dynamically generated different attention matrices for each group, establishing global dependencies between joints without requiring knowledge of the human topology. In the temporal joint feature modeling, multi-scale temporal convolution was used to extract features of target behaviors at different scales. Finally, we proposed a spatio-temporal channel joint attention module to further refine the extracted spatio-temporal features. The proposed method achieved Top1 recognition accuracy rates of 92.5% and 89.3% on the cross-subject evaluation criteria for the NTU-RGB+D and NTU-RGB+D 120 datasets, respectively, demonstrating its effectiveness.