Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper introduces COCO-FACET, a new benchmark dataset designed to evaluate text-to-image retrieval models on attribute-focused queries, which differ from traditional general image caption queries. The researchers demonstrate that existing models, including CLIP-like and MLLM-based models, struggle with these specific attributes, especially those less prominent in images or less explored in training data like time and weather. To address this, they propose using promptable image embeddings with multimodal large language models (MLLMs), which significantly improves retrieval performance on attribute-focused queries. The paper also explores acceleration strategies for this method to enhance its practical application.