Unsupervised learning of keypoints and landmarks has seen significant progress with the help of modern neural network architectures, but performance is yet to match the supervised counterpart, making their practicability questionable. We leverage the emergent knowledge within text-to-image diffusion models, towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). To do so, we simply optimize the text embedding such that the cross-attention maps within the denoising network are localized as Gaussians with small standard deviations. We validate our performance on multiple dataset: the CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m datasets. We achieve significantly improved accuracy, sometimes even outperforming supervised ones, particularly for data that is non-aligned and less curated.
This interactive slider shows the progression of our training process for a given token on randomly chosen images. Move the slider to see how the attention map evolves during training.
We show more qualitative examples of our unsupervised keypoints for each dataset. Each keypoint should be consistently at the same semantic location.
In our study, we have adapted tokens learned from one dataset and applied them to a different dataset. To illustrate the effectiveness of our approach, we compare it with the existing benchmark in unsupervised keypoint estimation, known as Autolink. Unlike our method, where tokens are learned, Autolink employs a learned model that operates out of its initial distribution. The top row of our presentation showcases the results of our method, while the bottom row displays the outcomes achieved using Autolink.
@article{hedlin2023keypoints,
title={Unsupervised Keypoints from Pretrained Diffusion Models},
author={Hedlin, Eric and Sharma, Gopal and Mahajan, Shweta and He, Xingzhe and Isack, Hossam and Rhodin, Abhishek Kar Helge and Tagliasacchi, Andrea and Yi, Kwang Moo},
journal={arXiv preprint arXiv:2312.00065},
year={2023}
}