Back to Browse

How Will Mech Interp Help Make AGI Safe?

2.1K views
Nov 15, 2025
48:59

This is a talk I gave to my MATS 9.0 trainee scholars about my theories of change for how mechanistic interpretability can help make AGI safe, and how this impacts what research should be done. Notes: https://docs.google.com/document/d/1dKAjGPdKdyemy5rZUI96nYwNDonKfXM6H7p58FF5rcE/edit?usp=sharing 00:00 Why Interpretability? The North Star 05:13 How Interpretability Helps Make Aligned AGI 11:50 What Does 'AI Alignment' Mean? 20:32 Spotting Real Misalignment 28:35 What Happens After We Build AGI? 33:40 What Makes Basic Science Useful? 41:06 Precision vs. Completeness

Download

0 formats

No download links available.

How Will Mech Interp Help Make AGI Safe? | NatokHD