This is a talk I gave to my MATS 9.0 trainee scholars about my theories of change for how mechanistic interpretability can help make AGI safe, and how this impacts what research should be done.
Notes: https://docs.google.com/document/d/1dKAjGPdKdyemy5rZUI96nYwNDonKfXM6H7p58FF5rcE/edit?usp=sharing
00:00 Why Interpretability? The North Star
05:13 How Interpretability Helps Make Aligned AGI
11:50 What Does 'AI Alignment' Mean?
20:32 Spotting Real Misalignment
28:35 What Happens After We Build AGI?
33:40 What Makes Basic Science Useful?
41:06 Precision vs. Completeness
Download
0 formats
No download links available.
How Will Mech Interp Help Make AGI Safe? | NatokHD