Can Interpretability Control Model Training?

Name: Can Interpretability Control Model Training?
Uploaded: Nov 23, 2025
Duration: 1349 s
Description: A talk I gave to my MATS 9.0 Training Program on using interpretability to steer finetuning If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app 0:00:00 Introduction: Three Ways to Steer AI 0:01:45 Ablating Concepts With CAFT 0:07:45 Preventative Steering 0:13:10 Filtering Data with Attribution 0:17:30 Applying to RL?

Neel Nanda12.6K subscribers

1.4K views

Nov 23, 2025

22:29

A talk I gave to my MATS 9.0 Training Program on using interpretability to steer finetuning If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app 0:00:00 Introduction: Three Ways to Steer AI 0:01:45 Ablating Concepts With CAFT 0:07:45 Preventative Steering 0:13:10 Filtering Data with Attribution 0:17:30 Applying to RL?

Download

0 formats

No download links available.