Back to Browse

Can Interpretability Control Model Training?

1.4K views
Nov 23, 2025
22:29

A talk I gave to my MATS 9.0 Training Program on using interpretability to steer finetuning If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app 0:00:00 Introduction: Three Ways to Steer AI 0:01:45 Ablating Concepts With CAFT 0:07:45 Preventative Steering 0:13:10 Filtering Data with Attribution 0:17:30 Applying to RL?

Download

0 formats

No download links available.

Can Interpretability Control Model Training? | NatokHD