Abstract:To address the issues of small inter-class differences and difficulty in distinguishing fine-grained images, this paper proposes a method that improves the network’s ability to express image detail features, aiming to alleviate this problem. To achieve this, an improved Transformer-based algorithm for finegrained recognition is designed in this study. Firstly, deformable convolutional token embedding adjusts the sampling points adaptively to modify the convolution operation range and the shape of its kernel, enhancing the network’s perception of spatial information for more accurate spatial details. Secondly, an efficient correlation channel attention mechanism automatically selects channels to transform the computation from neighboring channels to semantically similar channels, capturing semantic-related channel information. The precise spatial information and semantically related channel information effectively enhance the network’s perception of local features. Experimental results demonstrate that compared to the baseline algorithms, the proposed method improves recognition results by 1.5%, 2.4%, and 1.5% respectively on the CUB-200-2011, Stanford Cars, and Stanford Dogs datasets. These results indicate that the proposed approach effectively enhances the effectiveness of fine-grained image recognition by improving the expression capability of image detail features.