Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This research investigates how little training data is needed for Reinforcement Learning with Verifiable Reward (RLVR) to significantly boost the mathematical reasoning abilities of large language models (LLMs). Surprisingly, the authors demonstrate that training on even just one carefully chosen example can achieve performance comparable to using datasets containing thousands, resulting in substantial improvements on mathematical benchmarks. They explore the phenomena observed with such limited data, including post-saturation generalization where performance continues to improve after training accuracy plateaus, cross-domain generalization to different math topics, and an increase in self-reflection during problem-solving. The study identifies the policy gradient loss as the primary driver of this effectiveness, with entropy loss also contributing by promoting exploration.keepSave to notecopy_alldocsAdd noteaudio_magic_eraserAudio OverviewflowchartMind Map