Abstract:Transformer-based target detectors proposed in recent years have simplified the model structure and demonstrated competitive performance. However, most of the models suffer from slow convergence and poor detection of small objects due to the way the Transformer attention module handles feature maps. To address these issues, this study proposes a Transformer detection model based on a pre-filtered attention module. Using the target point as reference, the module only samples a part of the feature points near the target point, which saves training time and improves detection accuracy. A newly defined directional relative position encoding is also integrated in the module. The encoding compensates for the lack of relative position information in the module due to the weight calculation that is more helpful for the detection of small objects. Experiments on the COCO 2017 dataset show that our model reduces the training time by a factor of 10 and improves the detection accuracy, especially on small object detection by 26.8 APs.