Zubnet AI學習Wiki › StepFun
公司

StepFun

又名: Step models, multimodal AI
中國 AI 新創,建構有競爭力的大語言和多模態模型。他們的 Step 系列在國際 benchmark 上表現強勁,背後有大量算力投資。

為什麼重要

StepFun 證明了中國 AI 生態能從零產生嚴肅競爭者,不只是從現有科技巨頭。他們的 Step 模型在國際 benchmark 上持續超水平發揮,向多模態和影片生成的快速擴張展示了組織良好的新創能用相對溫和的資源覆蓋廣闊的能力領域。對全球 AI 市場來說,StepFun 代表那種讓人沒法忽視中國獨立 AI 新創場景的公司 — 技術強勁、國際化、動作夠快讓更大的競爭者保持誠實。

Deep Dive

StepFun (officially Jieyue Xingchen, meaning "step toward the stars") was founded in 2023 by Jiang Daxin, a former senior researcher at Microsoft Research Asia. Jiang had spent years working on large-scale language models and multimodal systems before deciding the time was right to build an independent AI company in Shanghai. StepFun raised approximately $100 million in its early rounds from investors including Tencent, Sequoia China, and Zhongguancun Science City — enough to secure significant GPU resources in a market where compute was becoming increasingly scarce. From day one, the company aimed to build general-purpose foundation models that could compete internationally, not just within the Chinese domestic market. That ambition was unusual for a startup barely months old, but StepFun backed it up with surprisingly strong benchmark results.

The Step Model Family

StepFun's model lineup has evolved rapidly. The Step-1 series, released in stages throughout 2024, demonstrated that a well-resourced startup could match or exceed some of the outputs from much larger organizations. Step-1V, their vision-language model, posted competitive scores on multimodal benchmarks at a time when the field was still dominated by Google, OpenAI, and a handful of Chinese giants. Step-2, released later, pushed further into multi-step reasoning and tool use. What set StepFun apart was not any single breakthrough but the consistency: each release showed genuine improvement, and the models performed well on both Chinese and English tasks, suggesting the training data and methodology were thoughtfully assembled rather than simply throwing more compute at a bigger dataset. The company also released models on Hugging Face and through their own API, making them accessible to the international developer community.

Multimodal and Video Ambitions

While many Chinese AI startups focused initially on text-only language models, StepFun moved aggressively into multimodal territory. Their Step-1.5V and subsequent vision models could process images, charts, and documents alongside text, targeting the increasingly important niche of visual reasoning. More recently, StepFun entered the video generation space with Step Video, joining a crowded but high-profile race alongside Kling, Vidu, and the various Hunyuan video models. The video work is notable because it requires a fundamentally different kind of infrastructure and expertise — temporal consistency, physics-aware generation, and the ability to handle long-form output. StepFun's willingness to tackle this alongside their core language model work suggests either extraordinary confidence or extraordinary ambition, possibly both.

Positioning in a Crowded Market

China's AI startup scene in 2023-2025 has been described as a "hundred model war," with dozens of companies burning billions of yuan chasing the same prize. StepFun's strategy has been to stay technically competitive while remaining lean relative to peers like Moonshot AI or Zhipu AI. The company has been less aggressive about consumer-facing products than some competitors, focusing instead on API access and developer tools — a bet that the real money in AI will flow through enterprise integration rather than chatbot subscriptions. This mirrors the approach of companies like Mistral in Europe, and it gives StepFun flexibility: they can partner with larger companies for distribution while maintaining control over their core technology. The question is whether a relatively young startup can sustain the compute investment needed to stay at the frontier as the cost of training runs escalates into the hundreds of millions of dollars.

相關概念

← 所有術語
← State Space Model Stochastic Parrot →
ESC