From Zero to Reasoning Hero: How DeepSeek-R1 Leverages Reinforcement Learning to Master Complex Reasoning [En/中]

千呼万唤始出来:DeepSeek-R1 如何通过强化学习实现复杂推理

Posted by Yihua Zhang on January 20, 2025

From Zero to Reasoning Hero: How DeepSeek-R1 Leverages Reinforcement Learning to Master Complex Reasoning

It is well said that 2024 was the year of the agent, but 2025 is shaping up to be the year of reinforcement learning. And DeepSeek-R1 just proves that. It also underscores how an “open” AI company makes much more contributions than OpenAI to the open-source community.

1. Introduction

Since the blockbuster release of DeepSeek-V3, DeepSeek has been the shining star of the LLM community. Enthusiasts and experts alike have eagerly awaited the open-source preview of “DeepSeek-R1-Lite.” And here it comes, making a grand entrance in the first month of 2025—ready to redefine how we think about AI reasoning. DeepSeek-R1 breaks the norm. This new approach uses massive reinforcement learning (RL)—sometimes without any supervised warm-up—to unlock emergent reasoning capabilities, including extended chain-of-thought (CoT), reflection, verification, and even “aha moments.”

In this post, we explore two groundbreaking models in the DeepSeek lineage:

  • DeepSeek-R1-Zero: A model that learns complex reasoning behaviors purely through reinforcement learning without any supervised fine-tuning, showing emergent abilities like extended chain-of-thought, reflection, and self-correction.
  • DeepSeek-R1: Building on R1-Zero, this version incorporates a small amount of high-quality “cold-start” data alongside iterative reinforcement learning and supervised fine-tuning to produce more coherent, user-friendly outputs while maintaining state-of-the-art reasoning performance.

By comparing these models, their training strategies, and the underlying mathematics, we highlight how reinforcement learning is transforming LLM capabilities.

In this post, we will delve into:

  • How DeepSeek-R1-Zero achieved near state-of-the-art reasoning performance without any supervised data.
  • Why DeepSeek-R1 combines a small “cold-start” dataset with iterative RL and supervised fine-tuning to achieve even better user-friendly outputs.
  • How distillation from DeepSeek-R1’s advanced reasoning patterns can transform smaller dense models into powerful mini “reasoning engines.”
  • Lessons learned from exploring different RL mechanisms and why certain approaches fell short in large-scale experiments.

Consider this blog a technical lens into the biggest leaps (and near misses) of the DeepSeek-R1 pipeline.


2. Motivations and Background

2.1. Why Pure RL for Reasoning?

Traditionally, major leaps in LLM reasoning have come from providing large amounts of carefully annotated data. DeepSeek-R1 questions that assumption. The key hypothesis is simple yet bold: Can we just reward the model for correctness and let it discover the best way to think on its own? By eliminating SFT from the start (in the DeepSeek-R1-Zero case), the research team lets the LLM find its own chain-of-thought patterns purely from reward signals.

The DeepSeek-R1-Zero approach uses the Group Relative Policy Optimization (GRPO) algorithm, which optimizes the policy without a critic model, saving computational resources. The core of GRPO’s update rule is as follows:

\[\begin{aligned} \mathcal{J}_{GRPO}(\theta) = \mathbb{E}\Bigg[ & q \sim P(Q),\ \{o_{i}\}_{i = 1}^{G} \sim \pi_{\theta_{\text{old}}}(O | q) \\ & \cdot \frac{1}{G} \sum_{i = 1}^{G}\Bigg(\min \Bigg(\frac{\pi_{\theta}\left(o_{i} | q\right)}{\pi_{\theta_{\text{old}}}\left(o_{i} | q\right)} A_{i},\ \text{clip}\Big(\frac{\pi_{\theta}\left(o_{i} | q\right)}{\pi_{\theta_{\text{old}}}\left(o_{i} | q\right)}, 1-\varepsilon, 1+\varepsilon\Big) A_{i}\Bigg) \\ & \qquad \qquad \qquad -\ \beta\ \mathbb{D}_{KL}\left(\pi_{\theta}\ \|\ \pi_{\text{ref}}\right)\Bigg) \Bigg] \end{aligned}\]

Here, the advantage (A_i) for each sample in a group is calculated as:

\[A_{i}=\frac{r_{i}-\text{mean}\left(\left\{r_{1}, r_{2}, \cdots, r_{G}\right\}\right)}{\text{std}\left(\left\{r_{1}, r_{2}, \cdots, r_{G}\right\}\right)}\]

These equations encapsulate the mathematical backbone of how the model learns—optimizing its policy in groups and normalizing rewards to refine decision making without explicit step-by-step guidance.

2.2. Emergent Behaviors: The “Aha Moment” Phenomenon

One of the fascinating outcomes of large-scale RL training on LLMs is the spontaneous emergence of complex, self-reflective behaviors. DeepSeek-R1-Zero shows that, with enough updates, the model starts to:

  • Extend its chain-of-thought length for difficult problems,
  • Re-evaluate steps if an early approach seems likely to fail,
  • Show an actual “aha moment,” where it steps back, spots mistakes, and corrects itself.

For experts used to conventional fine-tuning, it’s quite striking to see an LLM spontaneously “learn to think better” purely via RL signals. This finding alone points to major opportunities in RL-driven self-improvement.


3. DeepSeek-R1-Zero: Reinforcement Learning Without a Net

DeepSeek-R1-Zero starts from a base LLM and, crucially, does no supervised fine-tuning. The research team introduced:

  1. Accuracy Rewards: Checking if the model’s final answer is correct (for math, code, logic).
  2. Format Rewards: Incentivizing a structured chain-of-thought, e.g., <think> ... </think> tags.

By optimizing these rewards, the model’s pass@1 on the AIME 2024 math benchmark skyrocketed from 15.6% to 71.0%—competitive with established top-tier models. Even more surprisingly, with majority-vote sampling, it reached 86.7%—overtaking OpenAI’s o1-0912 on the same dataset.

Why it matters:

  • The model learned how to reason through a set of tasks with zero “handholding.”
  • The improvement trajectory suggests a self-discovery of problem-solving techniques (like reflection, verification, etc.) that many believed required curated data.

But there’s a drawback: The output was often in tangles—mixing languages, lacking user-friendly structure, and occasionally showing bizarre rhetorical flourishes. Enter “cold-start” data for the next iteration.


4. DeepSeek-R1: Merging Cold Start with Large-Scale Reinforcement Learning

The next question was whether injecting a small supervised “cold-start” dataset (thousands of curated chain-of-thought samples) might fix the readability and language-mixing issues—and perhaps improve final performance. The team designed a multi-stage pipeline:

  1. Cold Start: Fine-tune a base model on a few thousand curated, human-friendly long CoTs.
  2. Reasoning-Focused RL: Scale up RL with math, coding, and logic tasks. This time, add language-consistency rewards to push the model into staying coherent in a single language.
  3. Rejection Sampling + SFT: Sample correct, well-structured chains-of-thought from the RL model, augment them with general capabilities data (writing, Q&A, self-cognition), and train a new base checkpoint.
  4. RL Across Scenarios: A second RL stage includes both reasoning tasks and general tasks for “helpfulness” and “harmlessness.”

Key Achievements:

  • The final model, DeepSeek-R1, now competes closely with OpenAI-o1-1217 on math and coding tasks.
  • It significantly improves upon its predecessor (DeepSeek-V3) in knowledge benchmarks such as MMLU and GPQA Diamond—especially in STEM-heavy topics.

Note: The synergy of minimal curated data + large-scale RL is a potent alternative to the heavy upfront SFT used by many leading LLM pipelines.


5. Distillation: Transferring Advanced Reasoning Patterns to Smaller Models

Why Distillation? Training a 70B model (like DeepSeek-R1) with large-scale RL is expensive—often out of reach for smaller research labs or organizations. However, the final DeepSeek-R1 can generate vast correct solutions for a wide range of tasks. So the authors exploit a simple but powerful approach: train smaller models (1.5B, 7B, 8B, 14B, 32B) directly from DeepSeek-R1’s curated outputs.

Highlights:

  • Distilled Qwen-based 7B model beats some much larger open-source models on math and code tasks.
  • Distilled 14B sets new records on certain reasoning benchmarks—proving that, if you have a strong teacher, smaller dense students can replicate advanced reasoning with surprisingly high fidelity.

Takeaway: Reinforcement learning on smaller base models (like a 7B or 32B) from scratch simply cannot compete with distillation from a more capable teacher model. The smaller model, left to RL alone, plateaus much lower and at higher cost. Distillation emerges as the “secret weapon” to swiftly propagate advanced reasoning behaviors to new architectures or smaller footprints.


6. Pitfalls and Unsuccessful Attempts

Experiments with:

  • Process Reward Models (PRM) found that it was difficult to robustly define or train step-wise correctness signals at massive scale.
  • Monte Carlo Tree Search (MCTS) for hierarchical solution exploration faced combinatorial explosion in the generation space and a fragile value model.
  • These methods are not necessarily doomed, but they proved too unwieldy in the large-scale RL context used for DeepSeek-R1.

For professionals considering internal RL pipelines: these experiences highlight the complexity of applying search or step-by-step reward systems to sequences as large as LLM outputs.


7. Broader Implications and Future Directions

7.1. General Capabilities vs. Specialized Reasoning

DeepSeek-R1 sometimes trails older siblings (like DeepSeek-V3) on complex dialogues, role-playing, or structured JSON outputs. How do we unify advanced chain-of-thought “brains” with full-fledged interactive features? The authors suggest the next wave of RL expansions could incorporate multi-turn tasks and advanced APIs directly into the chain-of-thought.

7.2. Language Mixing and Multi-lingual Support

DeepSeek-R1’s training optimizes specifically for English and Chinese, occasionally leading to “linguistic collisions.” Future expansions might incorporate fine-grained language-detection rewards or multi-lingual chain-of-thought alignment.

7.3. Software Engineering Use Cases

While the coding results are strong, the authors note that engineering tasks requiring large contexts or specialized reasoning are still a big RL frontier. Speeding up the RL evaluation loop on code correctness is non-trivial but highly impactful. Asynchronous or more incremental reward mechanisms could be the next big leap.

7.4. Prompt Engineering Sensitivities

Unlike older models, few-shot prompts tend to hurt DeepSeek-R1’s performance. Leaner, zero-shot instructions appear to work better. This is a curiosity for advanced users—worth exploring in your own environment if you adopt a chain-of-thought-based RL model.


8. Concluding Thoughts

The DeepSeek-R1 family, in particular DeepSeek-R1-Zero, fundamentally proves that massive RL can organically nurture strong reasoning patterns—even without any supervised “crutch.” Yet, the final version of DeepSeek-R1 shows the practical synergy of a small curated dataset plus multi-stage RL to ensure both power and usability.

For experts researching LLM training pipelines, distillation from a thoroughly RL-optimized teacher is one of the most cost-effective ways to spread advanced reasoning across model sizes. At the same time, the experiences with reward hacking, MCTS complexities, and partial success with process-reward approaches are cautionary tales.

In short, DeepSeek-R1 is a hallmark that invites us to rethink the role of reinforcement learning in shaping truly “intelligent” LLMs—and underscores how an open AI company makes much more contributions than OpenAI to the open-source community.

千呼万唤始出来:DeepSeek-R1 如何通过强化学习实现复杂推理

大家都说,2024 年是「智能体」之年,而 2025 年注定是「强化学习」之年。DeepSeek-R1 就是最好的佐证!也凸显了真正 Open 的 AI 比 OpenAI 为 AGI 做出的更大的贡献。

1. 简介

自从 DeepSeek-V3 重磅发布以来,DeepSeek 一直都是大模型社区的耀眼明星。无数爱好者与专家都在翘首以盼 “DeepSeek-R1-Lite” 的开源预览版。如今,它终于在 2025 年的第一个月华丽登场,重新定义我们对 AI 推理的想象。DeepSeek-R1 颠覆了传统路线,采用大规模强化学习(RL)——有时甚至无需任何监督微调——来激发前所未有的推理能力,包括超长思维链(Chain-of-Thought, CoT)、自我反思、结果验证,甚至令人惊叹的“aha 时刻”。

在本篇文章中,我们将探讨 DeepSeek 家族的两大创新力作:

  • DeepSeek-R1-Zero:这款模型完全依赖强化学习来习得复杂推理技能,从未进行过任何监督微调,同时展现出了诸多有趣的特性,比如超长的思维链、自我检验以及自我纠错。
  • DeepSeek-R1:在 R1-Zero 的基础上引入了少量高质量“冷启动”数据,然后通过多轮强化学习和监督微调,使得输出更易读、更贴近用户需求,同时依然维持了在推理任务上的强悍表现。

我们将通过对比这些模型的训练策略和核心数学原理,来展示强化学习是如何彻底改变大模型推理能力的。

在本文中,你将看到:

  • DeepSeek-R1-Zero 如何在没有任何监督数据的条件下,依靠强化学习就能达到接近最先进推理水平的表现。
  • DeepSeek-R1 为什么能通过少量“冷启动”数据与多阶段强化学习与微调结合,实现更友好的输出格式并继续强化推理能力。
  • 蒸馏(Distillation) 技术如何将 DeepSeek-R1 发现的高阶推理模式迁移到更小的稠密模型上,打造功能强大的“迷你推理引擎”。
  • 针对大规模强化学习中各种方法的探索教训——包括哪些方案行之有效,哪些又在实验中出现了瓶颈。

把这篇博文当作一个技术“放大镜”,带你细看 DeepSeek-R1 项目那些最耀眼的进展和难以避免的挫折。


2. 动机与背景

2.1. 纯强化学习如何助力推理?

在大模型推理领域,大部分突破通常都依赖于大规模、精细标注的数据。然而 DeepSeek-R1 为这一常识带来了新的挑战。它的核心假设很简约,却不那么简单:我们能否只通过奖励信号来教会模型正确回答,从而让它自己摸索出最优的思考方式? 当我们完全取消监督微调(在 DeepSeek-R1-Zero 中),研究团队让模型只依赖强化学习奖励来探索并形成自己的思维链。

DeepSeek-R1-Zero 采用了 Group Relative Policy Optimization (GRPO) 算法,不需要与策略模型同规模的价值网络,大大节省了训练成本。GRPO 的关键更新公式如下:

\[\begin{aligned} \mathcal{J}_{GRPO}(\theta) = \mathbb{E}\Bigg[ & q \sim P(Q),\ \{o_{i}\}_{i = 1}^{G} \sim \pi_{\theta_{\text{old}}}(O | q) \\ & \cdot \frac{1}{G} \sum_{i = 1}^{G}\Bigg(\min \Bigg(\frac{\pi_{\theta}\left(o_{i} | q\right)}{\pi_{\theta_{\text{old}}}\left(o_{i} | q\right)} A_{i},\ \text{clip}\Big(\frac{\pi_{\theta}\left(o_{i} | q\right)}{\pi_{\theta_{\text{old}}}\left(o_{i} | q\right)}, 1-\varepsilon, 1+\varepsilon\Big) A_{i}\Bigg) \\ & \qquad \qquad \qquad -\ \beta\ \mathbb{D}_{KL}\left(\pi_{\theta}\ \|\ \pi_{\text{ref}}\right)\Bigg) \Bigg] \end{aligned}\]

其中,每个样本 (A_i) 的优势函数 (advantage) 这样计算:

\[A_{i}=\frac{r_{i}-\text{mean}\left(\left\{r_{1}, r_{2}, \cdots, r_{G}\right\}\right)}{\text{std}\left(\left\{r_{1}, r_{2}, \cdots, r_{G}\right\}\right)}\]

这两条公式就是模型学习的数学核心:通过成组采样、对奖励进行标准化,DeepSeek-R1-Zero 在不依赖任何手动标注的情况下就能逐步完善自己的策略。

2.2. 自发性行为:当模型产生“aha 时刻”

大规模强化学习给 LLM 带来的最神奇的现象之一,莫过于其自动涌现的复杂且自我反思的行为。DeepSeek-R1-Zero 经过足够多的训练后,居然能:

  • 延长 处理复杂问题时的思维链;
  • 重评 解题思路,如果发现之前方法可能走不通,就会另辟蹊径;
  • 出现 真正的“aha 时刻”——模型会主动退回前面的推理步骤,找出并修正自己的错误。

对于那些习惯了传统监督微调的专家而言,眼见模型仅靠强化学习奖励就能“学会更好地思考”,着实令人惊艳。也因此,RL 赋予大模型自我进化的潜力,值得我们深入探索。


3. DeepSeek-R1-Zero:放手一搏的强化学习

DeepSeek-R1-Zero 是从基础大模型出发,完全不经过任何监督微调的数据集来训练的。研究团队主要引入了两类奖励信号:

  1. 准确度奖励 (Accuracy Rewards):根据模型是否在数学、编程或逻辑题上回答正确来打分。
  2. 格式奖励 (Format Rewards):鼓励生成具有固定格式,如 <think> ... </think> 这类更可读、更易于理解的思维链标记。

凭借这些奖励信号,DeepSeek-R1-Zero 在 AIME 2024 数学基准测试上的 pass@1 从 15.6% 飙升至 71.0%,达到与顶尖大模型不相上下的水平。更令人惊讶的是,借助多次投票(majority-vote),它竟然冲到了 86.7%,力压 OpenAI 的 o1-0912。

为什么这很重要?

  • 模型自学了如何应对各种任务,无需手把手式的监督标注。
  • 这个提升过程暗示了模型能自动摸索反思、验证等学习策略,而并不需要预先提供大样本数据。

然而,这也带来了一些问题: DeepSeek-R1-Zero 的输出可读性常常不佳,比如混合使用多种语言、格式混乱或出现奇怪的修饰。在这种情况下,引入“冷启动”数据就成了下一步的关键。


4. DeepSeek-R1:冷启动数据与大规模强化学习的融合

接下来的问题是:只要加一点点“冷启动”监督数据,能否解决可读性与语言混杂的问题,并且让模型在推理上继续精进?为此,研究团队制定了一个多阶段的训练流程:

  1. 冷启动 (Cold Start):先用少量高质量、人工精心整理的思维链数据对基础模型进行微调。
  2. 面向推理的强化学习:在数学、编程和逻辑任务上大规模强化学习。这一次,还加入了“语言一致性”奖励,强制模型用单一语言进行推理,避免中英文夹杂。
  3. 重采样 + 监督微调 (Rejection Sampling + SFT):对已经强化学习的模型进行重采样,筛选出正确且可读的思维链,再结合写作、问答、自我认知等通用场景数据,重训一个新的基线模型。
  4. 全场景强化学习:再一次强化学习,覆盖推理、可用性和安全性等多种场景,确保模型在“有用且无害”的同时还具备高水平推理。

成果亮点:

  • 最终版本 DeepSeek-R1 在数学和编程上可与 OpenAI-o1-1217 媲美。
  • 在知识类基准如 MMLU、GPQA Diamond 上表现优异,特别擅长 STEM 领域,超越之前的 DeepSeek-V3。

要点: 仅用少量人工优选数据加上大规模的 RL,就能替代不少此前需要的繁重监督微调工作——这或许会成为未来大模型训练的一种关键模式。


5. 蒸馏:把高阶推理能力传递给小模型

为什么要做蒸馏 (Distillation)? 训练一个像 DeepSeek-R1 这样规模(70B)的模型需要的资源可不小,大多数实验室难以承担。好在完成训练后的 DeepSeek-R1 可以生成海量准确答案,为了让更多小模型也能拥有类似的推理“头脑”,研究团队采用了一个简单而高效的方法:把 DeepSeek-R1 生成的优质数据用于微调更小的稠密模型(1.5B、7B、8B、14B、32B 等)。

实战结果:

  • 用 Qwen 系列做蒸馏后,7B 大小的模型竟能击败一些更大的开源模型,特别是在数学和代码推理上颇为亮眼。
  • 14B 蒸馏模型更是一举打破多项推理基准的记录,印证了“师父”够厉害,“徒弟”也能青出于蓝。

结论: 让小模型从零开始做大规模强化学习,往往难以企及大模型蒸馏而来的推理水平,并且成本更高。蒸馏因此成了一个高性价比的秘密武器,能快速把大型模型的思维精华移植到小模型上。


6. 踩过的坑与失败的尝试

研究团队在开发 DeepSeek-R1 的早期也做过多种努力,但并非都成功:

  • 过程奖励模型 (PRM):让模型在每个细小步骤都获得奖励,理论可行但在大规模训练中难以准确界定“一步”的正确性,也容易出现奖励欺骗(reward hacking)。
  • 蒙特卡洛树搜索 (MCTS):借鉴 AlphaGo / AlphaZero 的思路,试图在解题时分步搜索。可惜的是,生成空间在语言模型里基本无限大,很快就遭遇了指数级的复杂度和不稳定的价值评估。

这些方法并非一无是处,但在涉及超大规模 RL 训练时,实施细节远比预想复杂得多,也容易卡在训练效率的瓶颈上。


7. 更广泛的影响与未来方向

7.1. 通用能力 vs. 专项推理

DeepSeek-R1 在一些多回合对话、角色扮演和 JSON 输出等任务上,仍稍逊于 DeepSeek-V3。如何把强大的推理“脑”拓展到更复杂的交互场景,是下一步值得探索的议题。官方暗示或许可以尝试把多轮任务和高级 API 直接并入思维链当中。

7.2. 语言混杂与多语种支持

当前 DeepSeek-R1 的训练主要面向中英文,难免会在其他语言场景下出现“语言碰撞”。未来应该考虑更细粒度的语言检测与多语推理融合,使其在多语言环境下依旧保持高水平表现。

7.3. 软件工程场景

DeepSeek-R1 的编程推理表现已相当不错,但对更工程化的长代码理解和复杂代码管理还缺乏足够的大规模强化学习数据。要想在软件工程领域获得真正的“大脑级”工具,还需要更大规模的异步评测与自适应搜索机制,减少评测开销。

7.4. Prompt 工程敏感度

和许多旧式大模型不同,DeepSeek-R1 对 few-shot 提示 (few-shot prompt) 的反应往往不如零样本来得好。对那些有 Chain-of-Thought 需求的用户而言,这提示我们需要更谨慎地设计提示语,或者直接采用零样本思维链的方式。


8. 结语

DeepSeek-R1 系列(尤其是 DeepSeek-R1-Zero)向我们证明了:只要运用大规模的强化学习,模型就能自然而然地进化出强大的推理能力——甚至不需要任何人类标注。在此基础上,再用少量人工优选数据与多阶段 RL 相结合,便诞生了既能“深度思考”又能清晰表达的 DeepSeek-R1

对于那些专注于大模型训练管线研究的专家来说,用一个经过强化学习打磨的强大教师模型来做蒸馏,是推广高阶推理能力到小模型的最快捷方式。而在这个过程中,团队也深刻体会到大型 RL 策略中容易出现的奖励欺骗、搜索爆炸等风险。

总而言之,DeepSeek-R1 显示出强化学习在塑造“真正智慧”大模型方面的巨大潜能,也凸显了真正 Open 的 AI 比 OpenAI 为 AGI 做出的更大的贡献。