...

/

Architecture of DeepSeek-V3

Architecture of DeepSeek-V3

Learn about innovations in DeepSeek models’ architecture.

DeepSeek’s breakthrough isn’t just about making cutting‑edge AI accessible. The real innovation lies in its engine: a thoughtfully reimagined architecture that goes beyond simply adding more parameters. DeepSeek-V3 has a massive 671 billion parameters, but here’s the trick—it only uses 37 billion per token. This allows it to be incredibly powerful while keeping computations efficient and lightweight.

Instead of scaling up and running into massive computational costs, memory demands, and inefficiencies, DeepSeek employs optimization techniques to tackle these challenges. It leverages the Mixture-of-Experts (MoE) framework, which selectively activates only the most relevant parts of the model for each task, significantly reducing memory usage and computational overhead while maintaining high performance. Also, further enhancements such as Multi‑Head Latent Attention and Multi‑Token Prediction further boost efficiency, allowing the model to handle long-context tasks and diverse data with ease.

This clever design makes DeepSeek not only smarter—with improved reasoning and understanding—but also cheaper and more efficient. The result is an AI system that delivers GPT‑4o‑level performance at a fraction of the cost, empowering researchers, startups, and enterprises to innovate without the typical prohibitive expenses.

Press + to interact

In this lesson, we’ll understand the layers of DeepSeek’s architecture. We’ll begin by demystifying the Mixture of Experts paradigm—a key innovation underpinning DeepSeek’s efficiency—and then explore how DeepSeek has pushed the boundaries even further with its own unique enhancements.

What is a Mixture of Experts (MoE)?

DeepSeek models are built on the transformer architecture, which can be thought of as a bustling factory where every worker (or neuron) processes every task. Transformers use self‑attention and feed‑forward networks to understand context and generate language. What sets DeepSeek apart is the integration of the Mixture‑of‑Experts (MoE) approach—a strategy that brings in specialized teams of submodels only when needed, rather than having every part of the network work on every task.

Imagine you’re at a busy gourmet coffee shop that offers a wide range of specialty drinks, each crafted by a barista with unique expertise. Instead of having every barista make every drink, a smart ordering system directs your order to the barista best suited for that particular beverage. This is the essence of MoE: rather than processing each input through the entire network, a gating function acts like that smart ordering system—selecting only a small subset of experts (specialized submodels) for each input token.

In traditional MoE, there are two common approaches:

  • Dense MoE: All experts are activated for every input. While this can improve accuracy, it is computationally heavy.

  • Sparse MoE: Only the top‑k experts are activated for each input, dramatically reducing the computation required. Most large‑scale models, including those in DeepSeek’s family, use this sparse method.

How do these sparse MoE systems work? Instead of having every expert work on every task, a smart system called the gating function decides which few experts are best suited for each specific input—much like a manager who assigns tasks only to the most appropriate team members. However, if the manager always picks the same few experts, those specialists can become overworked while others sit idle, which isn’t efficient. Older models dealt with this by adding extra rules to force a fair distribution of tasks among all experts.

Press + to interact
Dense vs. Sparse MoE architecture
Dense vs. Sparse MoE architecture

DeepSeek‑V3 takes a smarter approach: It automatically adjusts how tasks are assigned so that every expert is used evenly without needing any extra balancing rules. It also breaks each expert into smaller parts, allowing the system to handle even more detailed or nuanced information within each expert’s knowledge.

Educative byte: While early MoE models were introduced as early as 2017, the Google Switch Transformer, launched in 2021, showcased the practical power of scaling sparse MoE architectures for large-scale AI, demonstrating impressive efficiency and performance.

Now that we’ve explored the foundational ideas behind MoE and how DeepSeek enhances this approach, let’s see these innovative techniques in action and understand how they contribute to DeepSeek’s remarkable performance.

What are the key components of DeepSeek’s MoE?

The MoE framework is built from several fundamental components: the gating mechanism that dynamically selects the appropriate experts, the experts themselves (specialized submodels that process different aspects of the input), and the load balancing and routing strategies that ensure efficient, balanced use of these experts. Below is a table summarizing these key components:

Component

Description

Experts

Specialized submodels that process different aspects of the input. They form the building blocks of the MoE by partitioning the FFN layers.

Gating Mechanism

A dynamic “smart switch” that selects the most relevant experts for each input token, ensuring that only a subset of the model is engaged.

Load Balancing and Routing

Strategies that distribute tasks evenly among experts, preventing some from being overused while others remain idle.

Now, let’s learn how DeepSeek takes these traditional components and enhances them for even greater efficiency and performance.

How are DeepSeek experts different?

In typical MoE designs, the transformer’s feed‑forward network (FFN) is split into several independent experts. Each expert is a sub‑network that, when activated, processes the input token. Traditionally, the model routes a token to a few experts (usually selected via top‑k from a computed score) without any further subdivision within an expert. DeepSeek‑V3 refines this basic idea in two significant ways.

Press + to interact
DeepSeek's variation of mixture of experts
DeepSeek's variation of mixture of experts

Firstly, DeepSeek‑V3 distinguishes between two types of experts. It divides the experts into:

  • Shared experts: These experts are always available, providing a stable processing backbone for every token.

  • Routed experts: These experts are conditionally activated based on the input token, allowing the model to leverage specialized, context‑dependent expertise.

This dual structure improves efficiency and enhances the model’s ability to handle diverse tasks. Secondly, DeepSeek‑V3 applies fine‑grained segmentation to its routed experts. Rather than treating each routed expert as a single monolithic block, DeepSeek‑V3 partitions them into multiple smaller segments. This fine‑grained expert segmentation enables the model to capture subtle nuances in the data, as each segment can focus on different aspects of the input. For those who want a more in‑depth understanding, mathematically, if we denote the FFN input for the token t as uₜ, the DeepSeek‑V3 FFN output is given by:

Where:

  • NsN_s is the number of shared experts that are always available

  • NrN_r ...