I made a video about one of my favorite papers! I hope you enjoy :)
===Summary===
"Applying Sparse Autoencoders to Unlearn Knowledge in Language Models" investigates using SAEs—tools that peer into the inside of LLMs—to remove undesirable capabilities from language models. In this video, I walk through the motivation of this work, the methods used, and the interesting results the authors found.
I highly recommend you read it for yourself here: https://arxiv.org/pdf/2410.19278#page=12.18
===My other videos on Sparse Autoencoders===
Matroshkya SAEs: https://youtu.be/GLWlS4qYnag?feature=shared
SAEs from the Ground Up: https://youtu.be/TKozVZoXAYs?feature=shared
===Video Chapters===
0:00 Intro
0:14 Context/Motivation
0:46 SAE Negative Clamping
1:01 Feature Identification
1:35 Experimental Setup
1:49 Single-Feature Steering
2:24 Multi-Feature Steering
3:29 Investigating RMU Hypothesis