Abstract:To address the challenges of accurately locating target regions and identifying fine-grained features in fine-grained image classification, we propose a fine-grained image classification method based on an improved multi-scale deformable convolution (MMAL). Firstly, by leveraging the variable receptive field principle of deformable convolution, our method dynamically adapts to different scales and shapes of target regions in sample images, enhancing the network′s ability to perceive the position of these regions. Subsequently, we utilize the Grad-CAM gradient backpropagation technique to generate network attention heatmaps, which reduces the interference from background noise and achieves precise localization of the image target regions. Finally, we introduce a positionaware spatial attention module that integrates coordinate positions and dual-scale spatial information, significantly improving the network′s capability to extract fine-grained features of the target regions. Experimental results demonstrate that, compared to baseline methods, our approach achieves improvements of 1.4%, 1.5%, and 1.9% in classification accuracy on the CUB-200-2011, Stanford Car, and FGVC-Aircraft datasets, respectively, validating the effectiveness of the proposed method.