A talk I gave to my MATS 9.0 Training Program on using interpretability to steer finetuning
If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app
0:00:00 Introduction: Three Ways to Steer AI
0:01:45 Ablating Concepts With CAFT
0:07:45 Preventative Steering
0:13:10 Filtering Data with Attribution
0:17:30 Applying to RL?
Download
0 formats
No download links available.
Can Interpretability Control Model Training? | NatokHD