Abstract:In order to solve the problems of single detection category, low detection accuracy and difficult detection of complex objects, a cascaded bag detection method integrating convolution and transformer is studied. CT-CBDet First, a deformable conformer is designed as a backbone network for feature extraction, which uses deformable convolution and spatial pyramid pooling modules to achieve geometric feature transformation and multi-scale feature fusion on the basis of the fusion of transformer and convolutional double network. feature modeling ability; then, a region proposal network with adaptive positive and negative sample selection based on anchor statistical features is proposed to balance the fairness of positive and negative selection of object samples at different scales and enhance the training stability of the model; finally, the cascade detection component of the model is trained end-to-end using multi-stage loss. The results show that the method improves the mAP by 5.8% and the small-scale object detection accuracy by 10.9% compared to the baseline method Cascade RCNN. It can be seen that CT-CBDet can effectively perform the bag detection task in complex scenes.