If a future model were to be dangerously misaligned, could we tell?
If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app
00:00:00 The Problem with Viral Demos
00:06:49 Hunting for "Eval Awareness"
00:17:00 Debunking the Shutdown Demo
00:24:00 Why Do Models Blackmail
00:31:33 A New Tool: The Resilience Score
00:32:30 The Science of Misalignment
00:35:45 How to Convince Skeptics?
00:47:00 The Future of AI Psychology