Project-Title – Paper Page

Abstract

Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.

Method

Key Contributions

In this paper, we investigate the disentanglement capabilities of Stable Diffusion Rombach et al. (2022), fine-tuned on medical images.
We propose the first method to traverse latent space based on language guidance and see the factorizing effect on the resulting images. Interpolation between samples permit continuous trajectories that can be sampled.
We propose a new metric, Classifier Flip Rate along a Trajectory (CFRT), to validate the presence of the desired disentanglement.

Overview of image synthesis pipeline — Figure 2: Reverse Diffusion for synthesizing disentangled images from text prompts using fine-tuned Stable Diffusion. The reverse diffusion process takes noisy latents, *z_T*, as input. A U-Net architecture generates the denoised latent, z₀. During the denoising process, the text embeddings, *e_t* (t ∈ {0,..,T}), from the pre-trained CLIP encoder are added to the latent via cross-attention modules [Rombach et al., 2022]. Finally, the de-noised latent is passed through the decoder to create the synthesized image. Note that the text embeddings can be replaced at some intermediate timestep t during the reverse diffusion.

Results

Disentanglement property of Stable Diffusion — Figure 3: Disentanglement property of the Stable Diffusion. Starting from Gaussian noise (left image) at sampling timepoint t=T, the reverse diffusion process denoises the image (right) at timepoint t=0. The text prompts for the “neutral” images (with dark borders) for CheXpert and ISIC are *Chest x-ray with no significant findings* and *A dermoscopic image with melanocytic nevus (NV)*, respectively. The images on the right (matched with colored borders) are the synthesized images with the same text prompts *Chest x-ray showing Support Devices* for CheXpert and *A dermoscopic image with melanocytic nevus (NV) showing ink* sampled at different timesteps during the reverse diffusion process. Notice that sampling closer to the timepoint t=0 results in a synthesized image similar to the original image and as we sample closer to the timepoints t=T, the patient’s anatomical structure changes.

Figure 4: t-SNE plot of generated latent vectors of SD sampled from noise showing disentanglement. The dots and the images with borders show the resulting SD latent vectors and their corresponding “neutral” images with the text prompt *Normal chest x-ray with no significant findings*. For each generated sample, we swap the original text condition to *Chest x-ray showing Support Devices* for one trajectory and to *Chest x-ray showing Pleural Effusion* for a different trajectory at multiple denoising steps during reverse diffusion.

Bezier Interpolations Along the Trajectory — Figure 5: Traversal along the latent trajectories of Stable Diffusion using language guidance. Given a neutral image of a chest X-ray projected onto latent space (start point), traversal along the trajectory is performed via language guidance. Sampling along the trajectory results in only a single attribute (e.g. support devices, pleural effusion, gel bubbles, hairs, ink, ruler) being altered from the start point (“neutral image”), while becoming more severe along each trajectory, while the patient identity is maintained.

Conclusion

Vision-language foundation models have rich latent representations that can be leveraged in medical imaging, where the data are limited. In this paper, we presented the first text-guided traversal along a non-linear trajectory in the disentangled latent space based on vision-language foundation models for medical images. The qualitative and quantitative results demonstrate that our method enables precise control over the synthesized images (including the interpolated images), allowing for targeted manipulation of visual attributes while preserving content. Future work will investigate ways to impose more structure and compositionality on the latent spaces.

BibTeX

@inproceedings{tehraninasab2025language,
  title     = {Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation},
  author    = {TehraniNasab, Zahra and Kumar, Amar and Arbel, Tal},
  booktitle = {Proceedings of the Mechanistic Interpretability Workshop (MIV) at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  address   = {Nashville, USA},
  month     = {June},
  note      = {Proceedings Track}
}