Zubnet AILearnWiki › StepFun
Companies

StepFun

Also known as: Step models, multimodal AI
Chinese AI startup building competitive large language and multimodal models. Their Step series has shown strong performance on international benchmarks, backed by significant compute investment.

Why it matters

StepFun is proof that China's AI ecosystem can produce serious competitors from scratch, not just from existing tech giants. Their Step models consistently punch above their weight on international benchmarks, and their rapid expansion into multimodal and video generation shows that well-organized startups can cover broad capability ground with relatively modest resources. For the global AI market, StepFun represents the kind of company that makes it impossible to ignore China's independent AI startup scene — technically strong, internationally oriented, and moving fast enough to keep much larger competitors honest.

Deep Dive

StepFun (officially Jieyue Xingchen, meaning "step toward the stars") was founded in 2023 by Jiang Daxin, a former senior researcher at Microsoft Research Asia. Jiang had spent years working on large-scale language models and multimodal systems before deciding the time was right to build an independent AI company in Shanghai. StepFun raised approximately $100 million in its early rounds from investors including Tencent, Sequoia China, and Zhongguancun Science City — enough to secure significant GPU resources in a market where compute was becoming increasingly scarce. From day one, the company aimed to build general-purpose foundation models that could compete internationally, not just within the Chinese domestic market. That ambition was unusual for a startup barely months old, but StepFun backed it up with surprisingly strong benchmark results.

The Step Model Family

StepFun's model lineup has evolved rapidly. The Step-1 series, released in stages throughout 2024, demonstrated that a well-resourced startup could match or exceed some of the outputs from much larger organizations. Step-1V, their vision-language model, posted competitive scores on multimodal benchmarks at a time when the field was still dominated by Google, OpenAI, and a handful of Chinese giants. Step-2, released later, pushed further into multi-step reasoning and tool use. What set StepFun apart was not any single breakthrough but the consistency: each release showed genuine improvement, and the models performed well on both Chinese and English tasks, suggesting the training data and methodology were thoughtfully assembled rather than simply throwing more compute at a bigger dataset. The company also released models on Hugging Face and through their own API, making them accessible to the international developer community.

Multimodal and Video Ambitions

While many Chinese AI startups focused initially on text-only language models, StepFun moved aggressively into multimodal territory. Their Step-1.5V and subsequent vision models could process images, charts, and documents alongside text, targeting the increasingly important niche of visual reasoning. More recently, StepFun entered the video generation space with Step Video, joining a crowded but high-profile race alongside Kling, Vidu, and the various Hunyuan video models. The video work is notable because it requires a fundamentally different kind of infrastructure and expertise — temporal consistency, physics-aware generation, and the ability to handle long-form output. StepFun's willingness to tackle this alongside their core language model work suggests either extraordinary confidence or extraordinary ambition, possibly both.

Positioning in a Crowded Market

China's AI startup scene in 2023-2025 has been described as a "hundred model war," with dozens of companies burning billions of yuan chasing the same prize. StepFun's strategy has been to stay technically competitive while remaining lean relative to peers like Moonshot AI or Zhipu AI. The company has been less aggressive about consumer-facing products than some competitors, focusing instead on API access and developer tools — a bet that the real money in AI will flow through enterprise integration rather than chatbot subscriptions. This mirrors the approach of companies like Mistral in Europe, and it gives StepFun flexibility: they can partner with larger companies for distribution while maintaining control over their core technology. The question is whether a relatively young startup can sustain the compute investment needed to stay at the frontier as the cost of training runs escalates into the hundreds of millions of dollars.

Related Concepts

← All Terms
← Stability AI Stochastic Parrot →
ESC