ShiQ: Bringing back Bellman to LLMs
Best AI papers explained - A podcast by Enoch H. Kang

Categories:
This paper introduces ShiQ, a novel offline reinforcement learning algorithm designed for fine-tuning large language models (LLMs) by adapting traditional Q-learning methods. The authors address the challenges of applying Q-learning to LLMs, such as computational cost and initialization issues, by deriving theoretically grounded loss functions from Bellman equations. ShiQ enables off-policy, token-wise learning and is evaluated on various benchmarks, including multi-turn settings, where it demonstrates effectiveness compared to existing methods like DPO and CoPG. The paper details the theoretical basis of ShiQ and includes empirical results from both synthetic and real-world datasets.