Making Video Models Adhere to User Intent with Minor Adjustments

TMLR 2026

Daniel Ajisafe1, Eric Hedlin1, Helge Rhodin2,1, Kwang Moo Yi1

1The University of British Columbia   2Bielefeld University

Paper Code Video Eval Data



In short: Slightly adjusting user-provided boxes to align with internal attention maps significantly improve the quality of generations.


Teaser figure.



Abstract

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method.




Qualitative video generations

T2V-Turbo Variants
The jellyfish is swimming The turtle is swimming
Trailblazer on T2V-Turbo Our method on T2V-Turbo Trailblazer on T2V-Turbo Our method on T2V-Turbo
The surgeonfish is swimming The tiger is walking
Trailblazer on T2V-Turbo Our method on T2V-Turbo Trailblazer on T2V-Turbo Our method on T2V-Turbo
Trailblazer Variants
The firebrat insect is moving
Peekaboo Trailblazer Our method
The horse is running
Peekaboo Trailblazer Our method
The whooper swan is flying
Peekaboo Trailblazer Our method
The jellyfish is swimming
Peekaboo Trailblazer Our method
User box in blue | Optimized box in orange





Visualizing attention map evolution

Image Slider
The interactive slider for the 1st frame shows how internal attention maps evolve as we apply our differentiable editing and box optimization. Our method performs a smooth, differentiable edit.





More qualitative video generations -- T2V-Turbo

The hyena is walking
Trailblazer on T2V-Turbo Our method w/o Box Opt. Our boxes + Trailblazer on T2V-Turbo Our method on T2V-Turbo
The otter is walking
Trailblazer on T2V-Turbo Our method w/o Box Opt. Our boxes + Trailblazer on T2V-Turbo Our method on T2V-Turbo
User box in blue | Optimized box in orange





Morphing task

Trailblazer Our boxes + Trailblazer
Our optimized boxes provide benefit for better control in the morphing task





More qualitative video generations -- Trailblazer

The big headed ant is moving
Trailblazer Our method w/o Box Opt. Our boxes + Trailblazer Our method
The marine iguana is walking
Trailblazer Our method w/o Box Opt. Our boxes + Trailblazer Our method
The leopard is walking
Trailblazer Our method w/o Box Opt. Our boxes + Trailblazer Our method
User box in blue | Optimized box in orange





Complex patterns

Stationary to move
U-turn
Zigzag
Our method is able to preserve complex patterns and difficult motions.



Failure cases

The orca is swimming
Trailblazer Our Method
Both methods fail to generate the desired output for "orca" as the base video model likely has no knowledge of the target prompt
The eagle is walking
Trailblazer Our Method
We hypothesize a strong token relationship between an eagle and the US national flag. Our method emphasizes the latter.