The Emergence of DeepSeek-R1 and What We Must Not Overlook

The blog explores the limitations of AI pre-training, focusing on data copyright issues and the constraints of the scaling law. It highlights legal challenges restricting high-quality training data and diminishing returns from expanding model size. The article introduces DeepSeek-R1, an AI model optimizing inference and reasoning through innovative architectures like Multi-Latent Attention and Mixture of Experts, showcasing a shift towards reasoning-based advancements.

Executive Summary

AI development is reaching a turning point as traditional pre-training approaches face diminishing returns. The scaling law’s effectiveness is weakening, with performance gains slowing despite exponential increases in model size and data volume. Data scarcity and copyright restrictions further hinder AI advancements, forcing a shift toward inference-based strategies. DeepSeek-R1 exemplifies this shift, leveraging Multi-Latent Attention (MLA) and Mixture of Experts (MoE) to improve reasoning efficiency.

Key Takeaways:

1. Scaling Law Limitations – Performance gains are slowing despite increased resources. GPT-4 required 10× more computational power than GPT-3 for only a 30% improvement in accuracy, signaling diminishing returns.‍

2. Copyright Challenges – Legal restrictions are limiting access to training data. The NYT vs. OpenAI and Getty Images vs. Stability AI lawsuits reflect a growing trend, with over 100 copyright cases filed against AI companies in 2023 alone.‍

3. Inference Optimization – DeepSeek-R1's MoE and MLA architectures reduce memory usage by 40% and speed up inference by 30%, allowing smaller models to match or exceed larger competitors in reasoning tasks.‍

4. AI Efficiency Gains – Reinforcement learning-driven models are achieving state-of-the-art results. DeepSeek-R1 outperforms previous-generation models by 15% in mathematics, 20% in coding, and 25% in logical reasoning benchmarks, proving that reasoning-focused AI can surpass brute-force scaling.

‍

This marks a paradigm shift in AI, where the focus is moving from scaling to efficient inference and enhanced reasoning capabilities to drive future advancements.

‍

Looking to integrate cutting-edge AI models like DeepSeek-R1 into your business?

________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1. Limitations of AI Pre-training: Data Copyright Issues and Constraints of the Scaling Law

1.1 Development of LLMs Through Scaling Laws and Pre-training

LLMs that have grown exponentially every year.

‍

Large Language Models (LLMs) have evolved based on the scaling law, which states that increasing model size and data volume improves performance. In fact, the latest models like GPT-4, Gemini, and Claude have demonstrated outstanding capabilities in tasks such as text generation, translation, and information retrieval by leveraging vast amounts of data and computational resources.

For example, GPT-4 has significantly outperformed previous-generation models in tasks like grammar correction and code generation, driving groundbreaking advancements across the AI field. These cases clearly illustrate that the scaling law has been a key principle in the development of LLMs.

1.2 Limitations of the Scaling Law and New Challenges

However, the AI industry is now facing a situation where simply expanding data and computational resources is no longer enough to ensure continuous performance improvements. The primary reasons cited are the scarcity of high-quality training data and restricted access to data, which are gradually weakening the effectiveness of the scaling law.

Curious about the next evolution of AI beyond scaling? Read our deep dive into DeepSeek-V3 and how it’s shaping the future of AI agents with Allganize’s LLM platform. Explore here

1.3 Data Copyright Issues: The Cause of High-Quality Data Shortage

For AI models, the diversity and quality of training data are crucial. However, as legal and ethical controversies grow, companies and content providers are becoming increasingly reluctant to allow their data to be used for AI training. As a result, AI companies are struggling to secure new high-quality data.

Cases:

The New York Times vs. OpenAI: The New York Times filed a copyright lawsuit against OpenAI for allegedly using its articles without permission.
Getty Images vs. Stability AI: Stability AI faced a lawsuit for allegedly using copyrighted images from Getty Images in its image generation model.
Sarah Silverman & Other Writers vs. OpenAI and Meta: Several writers, including Sarah Silverman, have taken legal action, claiming that AI models were trained on their copyrighted works.

These lawsuits highlight the increasing restrictions on AI models’ ability to freely utilize data, posing a major challenge to further improving performance through pre-training approaches.

1.4 Limits of Expanding Data and Computational Resources

Due to the shortage of high-quality data, the effectiveness of the scaling law is reaching its limits. While increasing model size and data volume still improves performance, the rate of improvement is gradually diminishing.

Expert Opinions:

Ilya Sutskever (Co-founder of OpenAI): Acknowledged the limitations of the scaling law and emphasized the need for new algorithmic innovations.
Jensen Huang (CEO of NVIDIA): Stated that future AI performance improvements will depend on inference optimization rather than pre-training.
Ben Horowitz (Silicon Valley venture capitalist): Pointed out that the strategy of expanding GPU resources is no longer effective.

These perspectives suggest that new approaches are necessary for the continued advancement of AI models.

1.5 Performance Improvement Through Enhanced Reasoning: DeepSeek-R1

As the limitations of pre-training performance improvements become evident, the AI industry is increasingly focusing on enhancing reasoning abilities as an alternative. Models optimized for reasoning can outperform others of the same size, and in some cases, smaller models have even surpassed larger ones.

The Case of DeepSeek-R1

The Chinese AI startup DeepSeek has demonstrated this shift through its DeepSeek-R1 model, which combines reinforcement learning and inference optimization techniques. This model excels in mathematical reasoning, complex problem-solving, and coding tasks, highlighting the AI industry's transition toward inference-focused development.

Such examples reinforce the expectation that AI models will continue to advance by enhancing their reasoning capabilities.

‍

As training progresses, the model takes more time to think, goes through more reasoning processes, and provides more thoughtful responses.

‍

As training progresses, the model undergoes more reasoning processes, ultimately leading to an increase in accuracy.

2. DeepSeek-R1 Model

DeepSeek-R1 is a state-of-the-art reasoning AI model developed by the Chinese AI startup DeepSeek, demonstrating exceptional performance in mathematics, coding, and logical reasoning. This large-scale model features 671 billion parameters and has been trained using reinforcement learning to acquire various reasoning patterns, including Chain-of-Thought exploration, self-verification, and reflection.

DeepSeek-R1 has achieved performance comparable to or even surpassing GPT-01 in benchmarks related to mathematics, coding, and logical reasoning.

2.1 DeepSeek-R1 Architecture: MLA & MoE

The architecture of DeepSeek-R1 is optimized for inference, combining Multi-Latent Attention (MLA) and a Mixture of Experts (MoE) structure. These two technologies reduce memory usage during inference and minimize response latency, ensuring efficient reasoning even for large-scale tasks.

1) Multi-Latent Attention (MLA)

Modern large language models (LLMs) are primarily based on the Transformer architecture, where Multi-Head Attention (MHA) plays a crucial role. MHA utilizes a Query, Key, and Value (QKV) structure to capture complex data patterns.

To illustrate this with a library analogy:

Query represents the topic or keyword of the book you are searching for.
Key refers to the book titles.
Value contains the actual content of the books.

In MHA, the Query constantly changes, while Key and Value remain relatively fixed. Because of this, LLMs store and utilize Key-Value (KV) caches in memory during inference. However, as model sizes grow, KV caches become excessively large, leading to increased memory usage and slower inference speeds.

To address this, DeepSeek proposed Multi-Latent Attention (MLA). MLA improves memory efficiency and enhances inference speed by converting KV into smaller-dimensional latent vectors (DKV).

How MLA Works

Dimensionality Reduction:

The traditional KV pairs are converted into low-dimensional vectors (DKV) for storage.

Information Restoration:

Since reducing dimensions causes information loss, a process is required to restore DKV to a higher dimension (U * DKV).
Restoration is performed using a reconstruction matrix (U), optimizing performance during this process.

As a result, the KV cache (DKV) remains significantly smaller while minimizing information loss.

Advantages of MLA

Memory Efficiency: Significantly reduces KV cache memory usage compared to traditional MHA.
Faster Inference: Lower memory I/O burden enables faster inference, even in large-scale models.

2) Mixture of Experts (MoE)

Expanding large language models (LLMs) to improve performance leads to a sharp increase in computational costs and memory usage. Traditional approaches always utilize all model parameters for every task, causing unnecessary computations and inefficiencies. To address this, the Mixture of Experts (MoE) architecture was introduced.

MoE selectively activates only a subset of expert networks (Experts) for each task, maximizing computational and memory efficiency.

DeepSeekMoE further enhances the standard MoE structure by introducing expert specialization and redundancy reduction, improving overall efficiency.

How MoE Works

‍

Router-Based Expert Selection

Each token is assigned to a subset of experts through a Router.
The router calculates the affinity between the token and experts, selecting one or two experts for processing.
Sparsity is maintained to conserve computational resources.

‍

Expert-Specific Feed-Forward Network (FFN)

The activated experts process the token inputs through their respective FFNs.
The final output is determined by a weighted sum of the experts’ gate values and computation results.

Additional Design Features of DeepSeekMoE

1.Fine-Grained Expert Segmentation

Experts are divided into smaller specialized units to enhance knowledge learning.
This increases the potential for knowledge combination across experts and strengthens task specialization.

2. Shared Expert Isolation

Some experts are designated as shared experts to learn common/general knowledge.
Other experts focus exclusively on acquiring highly specialized knowledge in specific domains.

Advantages of MoE

Computational Efficiency: Activates only the necessary experts, reducing unnecessary computations and lowering costs.
Memory Optimization: Allocates resources only to active experts, decreasing overall memory usage.
Faster Inference: Fewer activated parameters lead to improved inference speed.
Expert Specialization: Fine-grained and shared expert structures enable experts to develop specialized knowledge for diverse tasks.

3) Combining MLA and MoE

DeepSeek-R1 optimizes inference performance by combining MLA and MoE. MLA enhances memory efficiency by converting the KV cache into a lower-dimensional form, while MoE maximizes computational efficiency during inference by selectively activating only the necessary expert networks.

This allows DeepSeek-R1 to achieve both reduced memory usage and faster inference, while also ensuring high scalability and adaptability in large-scale data processing and various task environments.

2.2 DeepSeek-R1 Training Strategy: 4-Step Post-Training

Post-training is a concept that contrasts with pre-training, referring to the stage where a model, after large-scale pretraining, further learns human preferences and logical reasoning abilities. As mentioned earlier, the performance improvements from pre-training alone are reaching their limits, and post-training is being considered as a potential breakthrough.

The DeepSeek-R1 model is optimized through a four-stage post-training strategy based on the pre-trained DeepSeek-V3 model, leveraging the MLA and MoE architecture. Each stage is designed to enhance the model’s reasoning capabilities and user-friendliness.

Step 1: Cold Start

Objective:
Without human-annotated data in the initial training phase, the model's performance may be unstable or produce inconsistent responses. To address this, a small amount of high-quality cold-start data is used for fine-tuning.

Process:

Data Collection: Gather long-form Chain-of-Thought (CoT) data using few-shot prompting.
Data Design: Structure response formats to improve readability and consistency while filtering out unnecessary multilingual data.

Result:
The model fine-tuned with cold-start data demonstrated stable and consistent performance in the initial reinforcement learning phase.

Step 2: Reasoning-Oriented Reinforcement Learning

Objective:
Enhance the model’s performance in complex reasoning-based tasks such as mathematics, coding, and logic through reinforcement learning.

Process:

Reward System:
- Accuracy Rewards: Evaluates responses based on compiler feedback for code and correct labels for math problems.
  - For example, rewards are assigned based on the number of correct answers in LeetCode test cases.
- Format Rewards: Additional rewards are given for responses that clearly structure the reasoning process (Chain of Thought, CoT).
GRPO Learning Process:
- Training is conducted using the Group Relative Policy Optimization (GRPO) objective function, aiming to maximize Ai(ri)A_i(r_i)Ai(ri).

This formula may initially seem complex, but it is actually quite simple. The key idea is to standardize the reward (rir_iri) generated by the model (Advantage AiA_iAi) and maximize it. Here’s a detailed breakdown:

1. Policy Reward (rir_iri)

Represents the reward for a given policy.
For example, in coding tasks, it measures how well the model’s generated code passes LeetCode test cases.

2. Advantage (AiA_iAi)

Indicates how superior the current reward (rir_iri) is compared to other rewards within the same group.
If AiA_iAi is positive, it means the reward is above the group average, encouraging the model to choose that action (oio_ioi) more frequently.
The calculation follows this principle: Ai=ri−baselineA_i = r_i - \text{baseline}Ai=ri−baseline

3. Policy Update

The model optimizes its policy in the direction that maximizes AiA_iAi.
If the probability of choosing a specific action (oio_ioi) increases compared to past policies, AiA_iAi also increases, leading to gradual policy improvement.

4. Safety Mechanisms

To ensure stable learning, two constraints are applied:

Clipping:
- Prevents drastic changes in policy updates.
- For example, it avoids over-prioritizing a newly preferred action compared to the previous policy.
KL Divergence Penalty:
- Measures how different the current policy distribution is from the reference policy.
- By applying a penalty when divergence is too high, the policy evolves gradually, ensuring stable training.

Advantages of GRPO (Group Relative Policy Optimization)

Reduced Training Costs: Eliminates the need for a critic model, significantly lowering computational and memory requirements.
Efficient Group-Based Reward Evaluation: Optimizes policy by calculating relative reward differences across multiple samples within a group.
Increased Sample Efficiency: Utilizes group-based samples within a single episode, improving sample usage efficiency.

Result: Achieved high performance in complex reasoning tasks.

Step 3: Rejection Sampling & Supervised Fine-Tuning (SFT)

Objective:
Further enhance the model’s diverse capabilities through additional supervised learning after reinforcement learning.

Process:

Reasoning Data: Generated 600,000 new data samples using rejection sampling with the Step-2 trained model.
Non-Reasoning Data: Collected an additional 200,000 data samples for non-reasoning tasks such as translation and summarization.
Fine-Tuning: Used approximately 800,000 data samples to fine-tune the model over 2 epochs.

Result:
The model achieved more balanced performance across various tasks.

Step 4: Reinforcement Learning for All Scenarios

Objective:
Enhance both usefulness and harmlessness so the model aligns with diverse tasks and user preferences.

Process:

Reasoning Data: Applied rule-based rewards for math and coding tasks (same as Step-2).
General Data: Used preference pairs to help the model adapt to complex scenarios.

Evaluation Metrics:

Usefulness: Assesses the relevance and value of responses.
Harmlessness: Ensures responses are free from potential risks or biases.

Result:
The model successfully adapted to various data distributions, providing user-friendly and safe responses.

2.3 DeepSeek-R1 Performance

The DeepSeek-R1 model maximized its performance through a four-stage post-training strategy based on the MLA and MoE architecture. It achieved performance comparable to or exceeding competing models, including GPT-01, across various benchmarks. Notably, it demonstrated exceptional results in mathematics, highlighting the effectiveness of large-scale reinforcement learning and reward optimization.