Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper introduces Planning with a Natural Language Critic (PNLC), a novel approach for improving the planning capabilities of large language models (LLMs) in complex interactive tasks without relying on computationally expensive reinforcement learning (RL) fine-tuning or extensive inference-time search. PNLC trains a lightweight, goal-conditioned value function offline that predicts the likelihood of various future outcomes based on a proposed thought or strategy by the LLM agent. During inference, this value function acts as a natural language critic, providing the LLM with feedback on the potential positive and negative results of its thoughts, enabling the LLM to refine its reasoning and actions effectively and efficiently. Experiments on interactive tasks like web shopping, social deduction, and persuasion demonstrate that PNLC outperforms existing RL and prompting methods in both performance and efficiency, scaling to larger LLMs.