Abstract:Aiming at the problem that the existing human pose estimation algorithm has insufficient feature extraction of the backbone network, which leads to the loss of key point feature information, a human pose estimation network model (GLF-Net) combined with global-local feature fusion module is proposed. In order to obtain high-quality feature maps in the feature extraction stage, the algorithm improves the backbone network ResNet-50 from the global and local features, and designs a global polarization self-attention module and a local depth separable convolution module respectively. At the same time, a parallel structure is used to embed the module that combines global position information and local semantic information features into the Bottleneck layer of the backbone network, which can not only enhance the feature extraction ability of the original backbone network, but also provide effective global and local feature input for the subsequent Transformer network, thereby improving the performance of pose key point detection. The model test is carried out on the public human pose estimation dataset COCO 2017 and MPII dataset respectively. Compared with the benchmark algorithm (Poseur), the average accuracy of the pose key points is increased by 2.1%, the average recall rate is increased by 1.5%, and the proportion of correctly estimated key points (PCKh@0.5) is up to 90.6. The experimental results show that the proposed algorithm is superior to the existing similar methods in the accuracy of pose estimation, and can significantly improve the positioning accuracy of human pose key points.