Back to Browse

AI's Unknown Unknowns

9 views
May 5, 2026
8:12

The provided text introduces **Dedicated Feature Crosscoders (DFCs)**, a novel method for identifying specific behavioural differences between diverse AI models in an unsupervised manner. While traditional "model diffing" focuses on comparing a base model to its fine-tuned version, this research extends the technique to **cross-architecture comparisons** involving models like Llama, Qwen, and GPT. The authors demonstrate that **DFCs outperform standard crosscoders** at isolating model-exclusive features, such as political biases or safety mechanisms, by structurally partitioning the feature space. Key discoveries include an **"American Exceptionalism" feature** in Llama and **censorship-related features** in Qwen, which can be manipulated to control model outputs. Ultimately, this work establishes a new **safety auditing tool** for uncovering "unknown unknowns" in the internal representations of competing artificial intelligence architectures. **Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs** **Authors and Institutions:** Thomas Jiralerspong (Anthropic Fellow, Mila, Université de Montréal) and Trenton Bricken (Anthropic). **What problem the paper was trying to solve** The paper addresses the challenge of **applying model diffing—comparing the internal representations of AI models to identify behavioral differences—across models with fundamentally different architectures**. While standard crosscoders could theoretically achieve this, they have an inherent optimization prior that strongly favors learning shared features over model-exclusive ones, and consequently had previously only been successfully used to compare a base model to its own finetune. **What are the paper's key novel ideas?** The central novelty is the introduction of the **Dedicated Feature Crosscoder (DFC)**, an architectural modification that explicitly partitions the feature space by design to counteract the bias toward shared features. This represents the **first successful application of crosscoders to cross-architecture model diffing**, enabling the unsupervised discovery of strictly model-exclusive behaviors. **What is the architecture or method they are using?** The researchers build upon Sparse Autoencoders (SAEs) and introduce DFCs, which project the activations of two different models into a shared latent space. The DFC explicitly **partitions its feature dictionary into three disjoint sets: features exclusive to Model A, features exclusive to Model B, and shared features**. By structurally severing the gradient flow so that an exclusive feature cannot contribute to reconstructing the opposing model's activations, the architecture strictly enforces the isolation of model-specific differences. **Why the paper matters** This research establishes cross-architecture model diffing as a **viable method for identifying "unknown unknowns" in newly released AI models**, surfacing meaningful behavioral shifts without requiring a pre-existing evaluation suite or knowing what to look for. By successfully isolating real-world differences—such as American exceptionalism in Llama-3.1-8B-Instruct, Chinese Communist Party (CCP) alignment in Qwen3-8B, and copyright refusal mechanisms in GPT-OSS-20B—the paper proves that DFCs can reliably uncover specific ideological and safety-critical variations. **What are the potential applications** The primary application is as a **high-recall, unsupervised pre-screening tool for AI safety auditing**. It can be used by developers and safety researchers to rapidly evaluate novel LLM releases, discover hidden capabilities or biases, detect emergent misalignments, and flag concerning behaviors (like covert censorship, political narratives, or deceptive "sleeper agent" tendencies) before the models reach the general public. The description, research summary and video was generated by Google's NotebookLM on 2nd May 2026.

Download

0 formats

No download links available.

AI's Unknown Unknowns | NatokHD