Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper, authored by Google DeepMind researchers, explores the increasing reliance on large language models (LLMs) within information retrieval (IR) systems. It examines how LLMs function as rankers, judges, and assistants, powering various aspects from content creation to evaluation. The paper highlights the potential for biases to emerge from the interaction of these LLM-based components, providing empirical evidence that LLM judges exhibit a significant bias towards LLM-based rankers. The research also points out limitations in LLM judges' ability to discern subtle performance differences and suggests a lack of strong bias against AI-generated content in their preliminary findings. Ultimately, the paper calls for a more comprehensive understanding of the LLM-driven information ecosystem and proposes guidelines and a research agenda for reliable LLM usage in IR evaluation.