GASP: Guided Asymmetric Self-Play For Coding LLMs

March 16, 2026 ยท Grace Period ยท ๐Ÿ› ICLR 2026 Workshop on AI with Recursive Self-Improvement

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Swadesh Jana, Cansu Sancaktar, Tomรกลก Daniลก, Georg Martius, Antonio Orvieto, Pavel Kolev arXiv ID 2603.15957 Category cs.LG: Machine Learning Citations 0 Venue ICLR 2026 Workshop on AI with Recursive Self-Improvement
Abstract
Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning