In short:
Slightly adjusting user-provided boxes to align with internal attention maps significantly improve the quality of generations.
Abstract
With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method.
Qualitative video generations
T2V-Turbo Variants
The jellyfish is swimming
The turtle is swimming
Trailblazer on T2V-Turbo
Our method on T2V-Turbo
Trailblazer on T2V-Turbo
Our method on T2V-Turbo
The surgeonfish is swimming
The tiger is walking
Trailblazer on T2V-Turbo
Our method on T2V-Turbo
Trailblazer on T2V-Turbo
Our method on T2V-Turbo
Trailblazer Variants
The firebrat insect is moving
Peekaboo
Trailblazer
Our method
The horse is running
Peekaboo
Trailblazer
Our method
The whooper swan is flying
Peekaboo
Trailblazer
Our method
The jellyfish is swimming
Peekaboo
Trailblazer
Our method
User box in blue | Optimized box in orange
Visualizing attention map evolution
The interactive slider for the 1st frame shows how internal attention maps evolve as we apply our differentiable editing and box optimization. Our method performs a smooth, differentiable edit.
More qualitative video generations -- T2V-Turbo
The hyena is walking
Trailblazer on T2V-Turbo
Our method w/o Box Opt.
Our boxes + Trailblazer on T2V-Turbo
Our method on T2V-Turbo
The otter is walking
Trailblazer on T2V-Turbo
Our method w/o Box Opt.
Our boxes + Trailblazer on T2V-Turbo
Our method on T2V-Turbo
User box in blue | Optimized box in orange
Morphing task
Trailblazer
Our boxes + Trailblazer
Our optimized boxes provide benefit for better control in the morphing task
More qualitative video generations -- Trailblazer
The big headed ant is moving
Trailblazer
Our method w/o Box Opt.
Our boxes + Trailblazer
Our method
The marine iguana is walking
Trailblazer
Our method w/o Box Opt.
Our boxes + Trailblazer
Our method
The leopard is walking
Trailblazer
Our method w/o Box Opt.
Our boxes + Trailblazer
Our method
User box in blue | Optimized box in orange
Complex patterns
Stationary to move
U-turn
Zigzag
Our method is able to preserve complex patterns and difficult motions.
Failure cases
The orca is swimming
Trailblazer
Our Method
Both methods fail to generate the desired output for "orca" as the base video model likely has no knowledge of the target prompt
The eagle is walking
Trailblazer
Our Method
We hypothesize a strong token relationship between an eagle and the US national flag. Our method emphasizes the latter.