Unsupervised Semantic Correspondence Using Stable Diffusion

1University of British Columbia, 2Vector Institute for AI 3Google 4Simon Fraser University 5University of Toronto

NeurIPS 2023

Teaser – An image and a text embedding are input into Stable Diffusion. The text embedding is optimized to match a specific part of the source image's attention map, then applied to a target image. The argmax of these maps identify semantic correspondences between the two images.

Abstract

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

Method – Given a source image and a query point, we optimize the embeddings so that the attention map for the denoising step at time t highlights the query location in the source image. During inference, we input the target image and reuse the embeddings for the same denoising step t, determining the corresponding point in the target image as the argmax of the attention map. The architecture mapping images to attention maps is a pre-trained Stable Diffusion model which is kept frozen throughout the entire process.

Example succesful, mixed and failure cases are shown across the 3 datasets. Interestingly it can be seen that even the failure cases tend to map to reasonable regions of the image.

Attention maps for both correct and incorrect correspondences across Spair-71k, PF-Willow and CUB-200 are visualized for the layers indicated. Source and target locations are displayed as yellow stars on the source and target images respectively.

Concurrent Work

There's a lot of excellent work that was introduced around the same time as ours.

Diffusion Hyperfeatures consolidates multi-scale and multi-timestep feature maps from Stable Diffusion into per-pixel feature descriptors with a lightweight aggregation network.

A Tale of Two Features introduces a fusion approach that capitalizes on the distinct properties of Stable Diffusion (SD) features and DINOv2 by extracting per pixel features from each.

Emergent Correspondence from Image Diffusion extracts per pixel features from Stable Diffusion.

BibTeX

@article{hedlin2023unsupervised,
      title={Unsupervised Semantic Correspondence Using Stable Diffusion}, 
      author={Eric Hedlin and Gopal Sharma and Shweta Mahajan and Hossam Isack and Abhishek Kar and Andrea Tagliasacchi and Kwang Moo Yi},
      year={2023},
      eprint={2305.15581},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }