Unsupervised Keypoints from Pretrained Diffusion Models

Abstract

Unsupervised learning of keypoints and landmarks has seen significant progress with the help of modern neural network architectures, but performance is yet to match the supervised counterpart, making their practicability questionable. We leverage the emergent knowledge within text-to-image diffusion models, towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). To do so, we simply optimize the text embedding such that the cross-attention maps within the denoising network are localized as Gaussians with small standard deviations. We validate our performance on multiple dataset: the CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m datasets. We achieve significantly improved accuracy, sometimes even outperforming supervised ones, particularly for data that is non-aligned and less curated.

Overview

We pass a randomly initialized text embedding into Stable Diffusion and extract the attention maps. We then optimize the text embedding to have localized attention maps, by supervising them to become a single-mode Gaussian distribution, drawn at the location of their maxima. We also enforce attention maps to be transformation equivariant to small affine transformations on images. We repeat this process over a set of training images, which after optimization provides a set of K keypoints.

More qualitative examples

We show more qualitative examples of our unsupervised keypoints for each dataset. Each keypoint should be consistently at the same semantic location.

CelebA Aligned

CelebA in the wild

DeepFashion

Tai-Chi-HD

CUB-200-2011 all

CUB-200-2011 aligned

Human3.6m unaligned

Human3.6m

Generalizing learned keypoints to other (completely different) datasets

In our study, we have adapted tokens learned from one dataset and applied them to a different dataset. To illustrate the effectiveness of our approach, we compare it with the existing benchmark in unsupervised keypoint estimation, known as Autolink. Unlike our method, where tokens are learned, Autolink employs a learned model that operates out of its initial distribution. The top row of our presentation showcases the results of our method, while the bottom row displays the outcomes achieved using Autolink.

Ours (outdoor setup → lab setup)

Autolink (outdoor setup → lab setup)

Keypoints from Tai-Chi-HD (outdoor setup) → applied to Human3.6m unaligned (lab setup) Both methods demonstrate the ability to map Tai-Chi-HD tokens onto the Human3.6m dataset. This outcome is somewhat expected given the considerable similarity between the two datasets. Although Autolink experiences some instances of failure, it generally performs well in the majority of cases.

Ours (birds → humans)

Autolink (birds → humans)

Keypoints from CUB-200-2011 (birds) → applied to Tai-Chi-HD (humans) Our approach demonstrates promising results when adapting bird keypoints for human figures. A distinct advantage of our method is its ability to accurately identify the direction a person is facing, as indicated by the specific orientation of keypoints 2 and 6. In contrast, Autolink faces challenges in consistently mapping these same keypoints onto corresponding areas in the images.

Ours (faces → birds)

Autolink (faces → birds)

Keypoints from CelebA (faces) → applied to CUB-200-2011 (bird) Our method shows some capability in identifying bird faces, though it is not without its limitations. In contrast, Autolink tends to interpret the entire structure of the bird as a face, focusing on the overall shape rather than specific features. Given the significant differences between these two datasets, it is not surprising that both approaches encounter difficulties. However, it is noteworthy that our method uniquely succeeds in correctly mapping keypoints to the faces of the birds.

Ours (faces → whole body)

Autolink (faces → whole body)

Keypoints from CelebA (faces) → applied to Tai-Chi-HD (whole body) This task proved to be unexpectedly challenging, possibly due to varying scales involved. Autolink seems to fall short, failing to accurately locate the face. In contrast, our method demonstrates the capability to successfully identify the facial area.

BibTeX