Semantic Information Encoded in Diffusion Models

Event details
Date | 17.07.2023 |
Hour | 09:30 › 11:30 |
Speaker | Shuangqi Li |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Alexandre Alahi
Thesis advisor: Prof. Mathieu Salzmann
Co-advisor: Prof. Sabine Süsstrunk
Co-examiner: Prof. Volkan Cevher
Abstract
Diffusion models have achieved phenomenal success thanks to their ability to generate high-quality images conditioned on text prompts. However, to enable more control and flexibility in the generation process, we are encouraged to delve into the network architecture of diffusion models and investigate how semantic information of generated images is stored and can be modified. In this paper, we begin by introducing DDPM (Denoising Diffusion Probabilistic Models) proposed by~\cite{ho2020denoising}, which forms the backbone of state-of-the-art diffusion models. We then present the idea of utilizing the \textit{h}-space as the semantic latent space in diffusion models, as proposed by~\cite{kwon2022asyrp}. Next, we present the idea of Prompt-to-Prompt~\cite{hertz2022prompt}, which edits the interpretable cross-attention maps throughout the generation process in text-conditional diffusion models. Furthermore, we identify some challenges related to semantic control in diffusion models like Stable Diffusion. Finally, we propose a method of editing self-attention maps and a method of guiding attention via large language models and showcase preliminary results.
Background papers
Denoising Diffusion Probabilistic Models
Diffusion Models already have a Semantic Latent Space
Prompt-to-Prompt Image Editing with Cross-Attention Control.
Exam president: Prof. Alexandre Alahi
Thesis advisor: Prof. Mathieu Salzmann
Co-advisor: Prof. Sabine Süsstrunk
Co-examiner: Prof. Volkan Cevher
Abstract
Diffusion models have achieved phenomenal success thanks to their ability to generate high-quality images conditioned on text prompts. However, to enable more control and flexibility in the generation process, we are encouraged to delve into the network architecture of diffusion models and investigate how semantic information of generated images is stored and can be modified. In this paper, we begin by introducing DDPM (Denoising Diffusion Probabilistic Models) proposed by~\cite{ho2020denoising}, which forms the backbone of state-of-the-art diffusion models. We then present the idea of utilizing the \textit{h}-space as the semantic latent space in diffusion models, as proposed by~\cite{kwon2022asyrp}. Next, we present the idea of Prompt-to-Prompt~\cite{hertz2022prompt}, which edits the interpretable cross-attention maps throughout the generation process in text-conditional diffusion models. Furthermore, we identify some challenges related to semantic control in diffusion models like Stable Diffusion. Finally, we propose a method of editing self-attention maps and a method of guiding attention via large language models and showcase preliminary results.
Background papers
Denoising Diffusion Probabilistic Models
Diffusion Models already have a Semantic Latent Space
Prompt-to-Prompt Image Editing with Cross-Attention Control.
Practical information
- General public
- Free
Contact
- edic@epfl.ch