Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% f… PAPER Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders https://arxiv.org/abs/2605.27354 LISTEN https://researchpod.app/episode/611e7f79-5e5b-4659-bdb9-99d8d696c41e https://researchpod.app AUTHORS Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang TOPICS SAERL, Sparse Autoencoder (SAE), Intrinsic Data Properties, Post-training Data Engineering, Sparse Autoencoders (SAEs), Intrinsic Data Engineering, Data Properties for RL, SAE (Sparse Autoencoder), Curriculum Construction, Data Filtering, ElasticNet/Ridge Regression, Diversity-driven Batching ABOUT RESEARCHPOD ResearchPod turns research papers into podcast episodes so you can keep up with science while listening. https://researchpod.app DISCLAIMER This is an AI-generated podcast discussion of the paper and is not a substitute for reading the original work.
Download
1 formatsVideo Formats
Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.