Tagami,R., Kobayashi,H., Akizuki,S., Hashimoto,M.
Abstract:
In this study, we proposed a method for automatically generating high-quality CLIP training data to enhance the performance of text-based image retrieval with CLIP. CLIP training utilizes two types of image-text pair data: compatible (correct) pairs and incompatible (incorrect) pairs. Correct pairs, where the image and text content match, are traditionally created by scraping or similar methods, while incorrect pairs are formed by recombining elements of correct pairs. CLIP is trained contrastively to increase the similarity within correct pairs and decrease it within incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 27.0% improvement in Rank@1 score compared to vanilla CLIP.