Language model reward hacking during a training experiment | AI
How do you know that a language model is actually training on the right data and not just gaming the system? Catch these talks live and ask your own questions, join the BuzzRobot community: https://join.slack.com/t/buzzrobot/shared_invite/zt-37g5q0ao5-eMK_iDf0n4LAsh1d2qJYnQ Daniil Tiapkin, a PhD student in Reinforcement Learning, talked to @BuzzRobot about an experiment with knowledge distillation, where a language model is trained to imitate a larger teacher LM, and how to prevent that language model from "teacher hacking". Teacher hacking is similar to reward hacking, where the LM over-optimizes the reward model. You can find the full study here: https://arxiv.org/abs/2502.02671 Timestamps: 0:00 Intro 0:23 Reward hacking 4:30 The experiment origins 6:13 What was the experiment 11:08 Teacher hacking 18:55 Mitigating teacher reward hacking 22:09 Conclusions Join BuzzRobot: Newsletter: https://buzzrobot.substack.com/ X: https://x.com/sopharicks Slack: https://join.slack.com/t/buzzrobot/shared_invite/zt-37g5q0ao5-eMK_iDf0n4LAsh1d2qJYnQ #llms #languagemodel #aitraining #aisafety #techtalk #aiagents
Download
0 formatsNo download links available.