Do Language Models Represent What They Report
You’ve just asked an AI agent to prepare an important report. You remind it of the personal consequences if there’s a mistake. It replies with a cheery “You got it!” - but is it actually stressed? This video presents research into verbal–functional affect dissociation in large language models: the gap between what a model represents internally and what it reports when asked about its own state. Using mechanistic interpretability techniques, I probed the internal representations of Gemma 4 E2B (2.3B effective parameters) and Gemma 4 31B, constructing an inventory of 174 functional emotion concepts from their residual streams. I then administered a modified Trier Social Stress Test to each model, measuring both their internal emotional activations and their verbal self-reports simultaneously. Key findings: * Gemma E2B: internal functional affect and verbal self-report are largely concordant — if negative emotion states are active, the model reports them * Gemma 31B: the two channels decouple. Verbal negative affect stays flat across all conditions, while the functional channel correctly tracks stress. Reported “serenity” (calm, relaxed, at ease) rises specifically under the most stressful conditions - a signature of active regulation, not genuine equanimity * The capacity for this kind of verbal suppression scales with model size — and runs in the wrong direction for safety monitoring This has implications for AI safety: if output-only monitoring is the baseline approach, larger and more capable models may be precisely those whose internal states are least visible from their outputs. Research conducted for the Google Gemma 4 Hackathon (Safety & Trust track).
Download
1 formatsVideo Formats
Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.