DeepSeek's origin story is unlike any other major AI lab. The company was founded in 2023 as a subsidiary of High-Flyer Capital Management, a quantitative hedge fund based in Hangzhou, China, founded by Liang Wenfeng. High-Flyer had been building its own AI infrastructure for trading since 2016 and had accumulated a substantial GPU cluster — reportedly around 10,000 NVIDIA A100 chips — before U.S. export controls in October 2022 cut off China's access to the most advanced AI hardware. Liang, who holds degrees in electronic information engineering from Zhejiang University, decided to pivot that infrastructure toward general-purpose AI research. Unlike the typical startup trajectory of raising venture capital and hiring celebrity researchers, DeepSeek was entirely self-funded by High-Flyer, gave few interviews, and published papers that spoke for themselves. The team was young — largely drawn from top Chinese universities — and operated with minimal public profile.
DeepSeek's early releases were solid but didn't make major headlines. DeepSeek-V1 and the DeepSeek Coder models showed competence without challenging the frontier. That changed dramatically with DeepSeek-V2 in May 2024, which introduced Multi-Head Latent Attention (MLA) — a technique that compressed the key-value cache during inference, dramatically reducing memory requirements and cost. The model used a Mixture of Experts architecture with 236 billion total parameters but only 21 billion active per token, making it both powerful and cheap to run. DeepSeek priced its API at roughly 1/30th the cost of GPT-4, sending a shock through the industry. Then came DeepSeek-V3 in December 2024, which the team claimed was trained for approximately $5.5 million in compute costs — a figure that, if accurate, was an order of magnitude less than what Western labs spent on comparable models. V3 used FP8 mixed-precision training, a multi-token prediction objective, and auxiliary-loss-free load balancing for its MoE layers, each a meaningful innovation in training efficiency.
DeepSeek-R1, released on January 20, 2025, was the moment the wider world took notice. R1 was a reasoning model in the mold of OpenAI's o1 — it could "think" through complex problems step by step before answering — and it matched or exceeded o1's performance on math, coding, and science benchmarks. The model was released as open weights under an MIT license. The impact was immediate and dramatic. On January 27, the day markets fully processed the implications, NVIDIA's stock dropped nearly 17% in a single session — the largest single-day market cap loss in U.S. history at the time — as investors recalculated whether the assumption that AI progress required ever-increasing GPU spending still held. The "DeepSeek shock" became a geopolitical event: if a Chinese lab could match frontier U.S. models despite being cut off from the latest hardware, what did that say about the effectiveness of export controls? And if training costs were plummeting, what happened to the business models of companies selling expensive AI infrastructure?
The technical story behind DeepSeek's efficiency is genuinely interesting and doesn't reduce to a single trick. The team made aggressive use of architectural innovations (MLA, DeepSeekMoE with fine-grained experts), training techniques (FP8 from the start of pre-training rather than just inference, multi-token prediction, carefully tuned learning rate schedules), and infrastructure engineering (custom kernels, aggressive pipeline parallelism). For R1 specifically, they used a novel reinforcement learning approach: rather than relying on expensive human preference data like RLHF, they applied Group Relative Policy Optimization (GRPO) on math and coding tasks with verifiable answers, letting the model discover chain-of-thought reasoning patterns largely on its own. A small "cold start" dataset helped, but the core insight was that reasoning could emerge from RL with ground-truth verification rather than requiring massive human annotation. They also demonstrated "distillation" — training smaller models (1.5B, 7B, 8B, 14B, 32B, 70B parameters) to mimic R1's reasoning chains, producing a family of efficient models that punched well above their size class.
DeepSeek cannot be understood outside the context of U.S.-China tech competition. The company's models comply with Chinese censorship requirements — ask about Tiananmen Square, Taiwan's independence, or Xi Jinping, and you'll get either a refusal or the Chinese government's official position. This is a legal requirement for any AI company operating in China, not a choice, but it limits the models' utility for users who need uncensored outputs (though the open weights mean others can fine-tune censorship out). The U.S. export controls that restrict China's access to advanced GPUs are both an obstacle DeepSeek has worked around and, paradoxically, a spur that forced them toward the efficiency innovations that became their advantage. There are also open questions about DeepSeek's actual compute resources — some analysts have speculated that High-Flyer may have stockpiled more GPUs than publicly acknowledged before the export ban, and the $5.5 million training cost figure for V3 has been questioned as potentially excluding significant prior research and infrastructure costs. Regardless, DeepSeek's achievements are real, their papers are detailed and reproducible, and they have fundamentally changed the conversation about what's required to build frontier AI.