Abstract:The image retrieval methods based on deep hashing often use convolution and pooling techniques to extract local information from images and require deepening the network layers to obtain global long-range dependencies. These methods generally have high complexity and computational requirements. This paper proposes a vision Transformer-based image retrieval algorithm enhanced with attention, which uses a pre-trained vision Transformer as a benchmark model to improves model convergence speed and achieves efficient image retrieval through improvements to the backbone network and hash function design. On the one hand, the algorithm designs an attention enhancement module to capture local salient information and visual details of the input feature map, learns corresponding weights to highlight important features, enhances the representativeness of image features input to the Transformer encoder. On the other hand, to generate discriminative hash codes, a contrastive hash loss is designed to further ensure the accuracy of image retrieval. Experimental results on the CIFAR-10 and NUS-WIDE datasets show that the proposed method achieves an average precision of 96.8% and 86.8%, respectively, using different hash code lengths on two different datasets, outperforming various classic deep hashing algorithms and two other Transformer-based image retrieval algorithms.