Abstract:Pedestrian detection is an important branch in the field of object detection, and pedestrian detection algorithms have been well developed, but there exists severe occlusion between pedestrians in crowded scenes, which makes a great challenge for the detection task. To effectively alleviate this problem, this paper improves on YOLOv3 and proposes a single-stage dense pedestrian detection algorithm: Crowd-YOLO, which adds visible frame labeling information to the network to assist training, so that the network can predict both full-body frame and visible frame information to improve detection performance; proposes a time-frequency domain fused attention module (TFFAM), which adds frequency-domain channel attention and spatial attention to the network to redistribute features; uses data correlation upsampling instead of traditional bilinear interpolation to obtain a richer information representation of deep feature maps. This paper uses the very challenging large crowd scenario dataset CrowdHuman for training and testing. The experimental results show that the proposed method improves the AP50 metric by about 3.7% and the recall metric by 3.4% over the baseline, with the time-frequency domain fused attention module bringing a 2.3% AP performance gain. The experimental results verify the effectiveness of the proposed method in crowded scenarios.