Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe¹, Eric Hedlin¹, Helge Rhodin^2,1, Kwang Moo Yi¹

¹The University of British Columbia ²Bielefeld University

TMLR 2026

CVPR 2026 AI for Creative Visual Content Generation, Editing, and Understanding Workshop

In short: Slightly adjusting user-provided boxes to align with internal attention maps significantly improve the quality of generations.

Abstract

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method.

Qualitative video generations

T2V-Turbo Variants
The jellyfish is swimming		The turtle is swimming

Trailblazer on T2V-Turbo	Our method on T2V-Turbo	Trailblazer on T2V-Turbo	Our method on T2V-Turbo

The surgeonfish is swimming		The tiger is walking

Trailblazer on T2V-Turbo	Our method on T2V-Turbo	Trailblazer on T2V-Turbo	Our method on T2V-Turbo

Trailblazer Variants
The firebrat insect is moving

Peekaboo	Trailblazer	Our method

The horse is running

Peekaboo	Trailblazer	Our method

The whooper swan is flying

Peekaboo	Trailblazer	Our method

The jellyfish is swimming

Peekaboo	Trailblazer	Our method

User box in blue \| Optimized box in orange

Visualizing attention map evolution

The interactive slider for the 1st frame shows how internal attention maps evolve as we apply our differentiable editing and box optimization. Our method performs a smooth, differentiable edit.

Morphing task


Trailblazer	Our boxes + Trailblazer

Our optimized boxes provide benefit for better control in the morphing task

Complex patterns

Stationary to move

U-turn

Zigzag

Our method is able to preserve complex patterns and difficult motions

Failure cases

The orca is swimming

Trailblazer	Our Method

Both methods fail to generate the desired output for "orca" as the base video model likely has no knowledge of the target prompt

The eagle is walking

Trailblazer	Our Method

We hypothesize a strong token relationship between an eagle and the US national flag. Our method emphasizes the latter.

Acknowledgements

This work was supported by Borealis AI through the Borealis AI Global Fellowship Award. We acknowledge the computational resources provided by the University of British Columbia and thank the volunteers who participated in the user study.

The hyena is walking

Trailblazer on T2V-Turbo	Our method w/o Box Opt.	Our boxes + Trailblazer on T2V-Turbo	Our method on T2V-Turbo

The otter is walking

Trailblazer on T2V-Turbo	Our method w/o Box Opt.	Our boxes + Trailblazer on T2V-Turbo	Our method on T2V-Turbo

User box in blue \| Optimized box in orange