Intermediate

AI Video Generation: What Works and What Doesn’t

AI video is the hottest category in generative AI right now. It’s also the most overhyped. Here’s the honest truth about what these models can actually do, what they can’t, and how to get usable results without burning your budget.
Pierre-Marcel De Mussac & Sarah Chen March 19, 2026 12 min read

Let’s get one thing out of the way: AI video generation is genuinely impressive in 2026. The demos are jaw-dropping. The Twitter clips look like magic. And then you actually try to use these models for real work and discover the gap between “cherry-picked demo” and “reliable production tool.”

We’ve integrated every major video model into Zubnet and generated thousands of clips across all of them. This guide is what we wish someone had told us before we started.

The Uncomfortable Truth First

Expect 3–5 generations to get one good result.

AI video is not deterministic. The same prompt, same model, same settings will produce different results every time. Some will be stunning. Some will have a character with six fingers walking through a wall. This is normal. Budget for multiple attempts — not because the models are bad, but because video generation is inherently probabilistic and the quality variance is high.

With that said, the models available today are genuinely useful if you understand their strengths, limitations, and when to use each one.

The Six Models That Matter

Veo 3.1 — Benchmark Quality, Native Audio

Google’s Veo 3.1 produces the highest-quality output of any video model available today. The motion is natural, the physics are mostly correct, and the visual fidelity is stunning. It also generates synchronized audio natively — footsteps on gravel actually sound like footsteps on gravel, which is a first.

The catch: It’s slow. Expect 2–4 minutes per generation. And at premium pricing, iterating gets expensive fast. Veo 3.1 is the model you use for the final output, not for experimentation.

Best for: Final-quality clips, presentations, social media content where quality matters more than speed or budget.

Kling 2.6 Pro — The Daily Driver

If Veo 3.1 is the sports car you take out on weekends, Kling 2.6 Pro is the daily driver. It has the best motion quality in the industry — camera movement feels intentional, objects move with realistic weight and momentum, and character motion is fluid. It’s also faster and cheaper than Veo.

Kling is where we send most of our users, and it’s the model with the highest satisfaction rate. The results are consistently good — not always perfect, but the variance is lower than most competitors.

Best for: Regular video generation, social media content, prototyping, image-to-video. The best balance of quality, speed, and cost.

Runway Gen-4 — Consistent and Professional

Runway has been in the AI video space longer than anyone, and Gen-4 reflects that maturity. It’s the most consistent model — you’re less likely to get a bizarre artifact or a physics-defying glitch. The output feels professional, even if it doesn’t always reach Veo’s peak quality.

Runway also has the best understanding of cinematic language. Ask for a “slow dolly push into a subject with shallow depth of field” and it actually knows what that means. Other models interpret camera instructions loosely; Runway takes them seriously.

Best for: Professional content, corporate video, anything where consistency matters more than peak quality. Great for clients who can’t afford to see a weird result.

Luma Ray 3 — The Artist

Every model has a personality, and Luma Ray 3’s is artistic. It produces clips with a unique aesthetic — slightly dreamlike lighting, painterly motion, a visual quality that feels more like cinema than video. It’s not trying to be photorealistic; it’s trying to be beautiful.

Best for: Creative projects, music videos, artistic content, mood pieces. When you want the video to have a distinctive look rather than documentary realism.

Hailuo 2.3 — The Value Pick

Hailuo (from MiniMax in China) is the model nobody talks about but everyone should try. The quality is surprisingly good for the price — it’s one of the cheapest options available, and the results consistently land in “good enough for social media” territory. It handles text-to-video well and generates quickly.

Best for: High-volume content creation, social media, testing concepts before committing to a premium model. The budget-friendly workhorse.

Sora 2 — Long-Form Narrative

OpenAI’s Sora 2 differentiates itself with duration. While most models cap at 5–10 seconds, Sora can generate longer clips with narrative coherence — a character walks into a room, sits down, picks up a cup. The story holds together across the full duration.

Best for: Longer narrative clips, storytelling, scenes that require sustained action across multiple seconds without cuts.

Pricing Reality

Model Cost/Second 5s Clip Speed
Veo 3.1 $0.35 $1.75 2–4 min
Kling 2.6 Pro $0.14 $0.70 30–90 sec
Runway Gen-4 $0.20 $1.00 45–120 sec
Luma Ray 3 $0.16 $0.80 30–60 sec
Hailuo 2.3 $0.08 $0.40 30–60 sec
Sora 2 $0.25 $1.25 1–3 min

Remember the 3–5 generations rule. A single “good” 5-second Veo clip realistically costs $5–$9 when you factor in the attempts that don’t work out. A good Hailuo clip costs $1–$2. This is why model choice matters — not just for quality, but for your budget.

Text-to-Video vs. Image-to-Video

This is the single most important decision you’ll make, and most beginners get it wrong.

Text-to-Video (T2V)

You describe what you want in words: “A golden retriever running through a field of sunflowers at sunset.” The model generates everything from scratch — the dog, the sunflowers, the lighting, the camera angle.

Pros: Maximum creative freedom. Quick to start. No source material needed.

Cons: Less control over the exact look. The dog might not look how you imagined. The sunflowers might be the wrong shade of yellow. You’re at the mercy of the model’s interpretation.

Image-to-Video (I2V)

You provide a starting image — either one you created (using an AI image generator, or a real photo) — and the model animates it. The golden retriever looks exactly like the image you provided and then starts running.

Pros: Much more control. The visual style, subject, and composition are locked in by your source image. Fewer surprising results.

Cons: Requires a good starting image. Extra step in the workflow.

Our recommendation: Start with image-to-video.

Generate your starting frame with an image model (FLUX 2 Pro or Imagen 4), get it exactly right, then animate it. This two-step workflow gives you dramatically more control over the final result and wastes fewer video generations on results that “looked different than what I imagined.”

What AI Video Still Can’t Do Well

Honesty matters more than hype. Here’s what these models still struggle with in 2026:

Hands and fingers. Better than a year ago, but still the most common artifact. Characters may gain or lose fingers mid-clip. Watch for it.

Text and signage. Just like image models, video models can’t render readable text reliably. A storefront sign will be gibberish. Plan for it.

Physics consistency. Water pours upward. Objects pass through each other. Gravity works differently in different parts of the frame. Every model has physics glitches — some just hide them better.

Long duration. Most models cap at 5–10 seconds. Extending beyond that requires stitching clips together, which introduces consistency problems between segments. Sora 2 handles longer clips better than most, but even it has limits.

Precise control. You can’t say “move the camera exactly 30 degrees to the right over 3 seconds.” You can say “slow pan right” and hope the model interprets it reasonably. This is a description medium, not a control medium.

Practical Tips That Save Money and Frustration

1. Use Hailuo for drafts, premium models for finals. Generate your first few attempts with Hailuo at $0.08/sec. Once you’ve nailed the prompt and know what works, switch to Kling or Veo for the polished version.

2. Keep prompts focused. “A woman walks into a coffee shop, orders a latte, sits down, and opens her laptop” is four actions. That’s too many for a 5-second clip. Pick one: “A woman walks into a warmly lit coffee shop, the camera tracking her from behind.”

3. Specify camera movement. “Static shot,” “slow push in,” “orbit around subject,” “tracking shot following subject.” Without camera instructions, the model will choose randomly, and you might get jarring or inappropriate movement.

4. Describe the mood, not just the content. “Cinematic, moody, low-key lighting” produces dramatically different results than the same scene described as “bright, cheerful, natural daylight.”

The workflow that works: Generate a still image first (FLUX or Imagen). Perfect the look. Then feed that image to Kling or Veo for animation. This image-to-video approach cuts your iteration cycles in half and gives you far more control over the final result.

Where This Is Going

AI video is moving faster than any other category in generative AI. A year ago, 3-second clips with wobbly motion were state of the art. Today we have native audio, 10-second clips with coherent physics, and models that understand cinematic language. A year from now, the limitations we listed above will likely be halved.

But it’s not a replacement for traditional video production — not yet. It’s a complement. A way to prototype scenes before shooting them. A way to create B-roll that would cost thousands to film. A way to visualize ideas that exist only in your head.

The creators who thrive with AI video are the ones who understand it as a probabilistic creative tool, not a deterministic production pipeline. Generate, evaluate, iterate. That’s the rhythm.


Every model and price mentioned in this guide was tested on Zubnet, where you can access all of them through one platform with per-second pricing and no subscriptions. No lock-in, no credits to expire — just pay for what you generate.

Pierre-Marcel De Mussac & Sarah Chen
Zubnet · March 19, 2026
ESC