The Consistency Problem Is Solved. Now What Is the Real Bottleneck?
Founder Notes

The Consistency Problem Is Solved. Now What Is the Real Bottleneck?

By Usama Hassan6 min read

I have been directing AI film and brand work full-time since the moment it was technically possible to do so, and I want to make a precise observation about where the craft actually is in the spring of 2026. The character consistency problem is functionally solved. Runway Gen-4 holds character identity, wardrobe, and environment across cuts. Grok's persistent digital DNA feature ensures character height, eye colour, even cloth fold consistency. Veo 4's persistent world-state memory carries spatial geometry across long sequences. The thing the entire industry was complaining about a year ago is no longer the thing.

Most of the discourse has not caught up. People are still writing think pieces about the consistency problem. They are fighting the last war. The actual bottlenecks have moved, and they have moved to places the public conversation is barely touching.

Here are the three I deal with every working day, ranked by how much they currently cost me in real production.

Bottleneck one: spatial blocking across multi-character scenes

If you have two characters in a scene, current models can render them with consistent identity. They cannot reliably stage them in space relative to each other across a cut. Character A standing left of character B in shot one will, with disturbing frequency, appear right of character B in shot two, even with explicit positional language in the prompt. For a single-character beauty shot this does not matter. For a dialogue exchange, it destroys the scene.

We hit this so hard at Komodo X that we built a dedicated skill for it. Our spatial-blocking system flags any shot where two or more characters are in relative motion, or where camera and character motion happen simultaneously. For those shots, we author a top-down blocking diagram, sometimes a quick Blender low-poly render, sometimes a hand-drawn overhead sketch, and we feed that spatial logic back into the prompt. It is the difference between a scene that holds and a scene that confuses the audience without them quite knowing why.

If your studio is producing AI work without a dedicated spatial blocking layer in the pipeline, this is the bottleneck eating your output quality, and you may not have realised it yet.

Bottleneck two: editorial pacing inside the 15-second clip window

Most current engines max out at 15 seconds of stable generation per clip. Some go longer with quality compromises. The reality is that working production happens in 5 to 15-second beats, stitched together in post.

Within that window, none of the engines understand temporal pacing. A held silence. A delayed reaction. A cut-on-a-look. These are the granular textures of cinematic dialogue, and the model has no idea you want them unless you specify them, beat by beat, in the prompt.

The fix for this is not a better engine. It is a pacing layer between the screenplay and the shotlist. We tag every dialogue line and action beat with explicit pacing tokens. Beat. Pause. Cut-on-silence. Reaction-hold. Each shot in the shotlist gets coverage-grammar notes that tell the model who is holding the scene, who is reacting, when to cut. That metadata flows into the prompt, and the resulting clip respects the temporal logic of the scene.

"Same engine. Same shot. Different brief architecture. Completely different feel."
Usama Hassan

Bottleneck three: audio-visual sync at scene scale

Veo 4 ships with native Foley generation. Sora 2 handles sound effects. ElevenLabs sound effects integrate cleanly into post. The individual components are mature.

What is not mature is the sync between visual rhythm and audio rhythm at the scale of a full scene. A footstep that lands a frame late. A door close that anticipates the visual by twenty milliseconds. A dialogue line whose breath pattern does not match the character's chest movement. Audiences cannot articulate any of this consciously, but they feel it as wrongness, and they leave.

Right now there is no engine that handles this well. Our pipeline solution is to treat audio as a separate authoring layer, not as a generation output. We brief our audio team independently, with a beat-accurate brief that mirrors the pacing-manifest from the visual layer. The cost is real. We are doing what a feature film does at a fraction of the budget, but it is a labour-intensive layer that the discourse around AI film is barely acknowledging exists.

What this means in practice

If you are running an AI studio in 2026, your pipeline maturity is not measured by which engine you are using. It is measured by how much editorial intelligence sits between the screenplay and the prompt. Spatial blocking. Pacing tokens. Audio briefs that mirror visual pacing. These are the unglamorous, expensive, defensible layers that separate a studio that ships work from a studio that ships demos.

The consistency problem was solved by the engines. The next set of problems will not be. They will be solved by the studios that take craft infrastructure seriously, and lost by the studios that keep waiting for the next model to fix it for them.

It will not.

Usama Hassan is the AI Creative Director at Komodo X, where he leads visual direction across the studio's brand and original IP work, and architects the production craft layers of the XON pipeline.