The research paper introduces DeepSeek-V4, a series of large-scale language models designed for highly efficient processing of contexts containing up to one million tokens. The series features two primary models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, which utilise a Mixture-of-Experts architecture to balance high performance with reduced computational costs. Key technical advancements include a hybrid attention mechanism that significantly shrinks the memory footprint of the KV cache and a specialised optimizer called Muon to improve training stability. In practical benchmarks, the Pro-Max variant achieves state-of-the-art results among open models, particularly in complex reasoning, coding, and long-horizon tasks. The authors also detail a post-training pipeline that employs domain-specific experts and on-policy distillation to refine the models' agentic and mathematical capabilities. Overall, the release establishes a new foundation for test-time scaling and the routine handling of ultra-long digital sequences.