Back to Browse

Sparse Autoencoders Unlearn Knowledge in LLMs | A Paper-Based Walkthrough

3.8K views
May 24, 2025
5:01

I made a video about one of my favorite papers! I hope you enjoy :) ===Summary=== "Applying Sparse Autoencoders to Unlearn Knowledge in Language Models" investigates using SAEs—tools that peer into the inside of LLMs—to remove undesirable capabilities from language models. In this video, I walk through the motivation of this work, the methods used, and the interesting results the authors found. I highly recommend you read it for yourself here: https://arxiv.org/pdf/2410.19278#page=12.18 ===My other videos on Sparse Autoencoders=== Matroshkya SAEs: https://youtu.be/GLWlS4qYnag?feature=shared SAEs from the Ground Up: https://youtu.be/TKozVZoXAYs?feature=shared ===Video Chapters=== 0:00 Intro 0:14 Context/Motivation 0:46 SAE Negative Clamping 1:01 Feature Identification 1:35 Experimental Setup 1:49 Single-Feature Steering 2:24 Multi-Feature Steering 3:29 Investigating RMU Hypothesis

Download

1 formats

Video Formats

360pmp47.2 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

Sparse Autoencoders Unlearn Knowledge in LLMs | A Paper-Based Walkthrough | NatokHD