Abstract
Medical image synthesis presents unique challenges due to the inherent complexity and high-resolution details required in clinical contexts. Traditional generative architectures such as Generative Adversarial Networks (GANs) or Variational Auto Encoder (VAEs) have shown great promise for high-resolution image generation but struggle with preserving fine-grained details that are key for accurate diagnosis. To address this issue, we introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024 × 1024. Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation, enabling the preservation of both global anatomical context and local image-level details. By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels. We apply our model to the CheXpert dataset and demonstrate its ability to generate clinically faithful chest X-rays from text prompts. Beyond visual quality, these high-resolution synthetic images prove valuable for downstream tasks such as classification, showing measur- able performance gains when used for data augmentation, particularly in low-data regimes.
Method
Key Contributions
- Ultra-high-resolution synthesis: First VLM to generate medical images at 1024×1024 (and up to 2048×2048) with fine clinical detail.
- Efficient fine-tuning: Uses LoRA on SDXL with medical prompts for scalable, domain-specific generation.
- Improves downstream tasks: Boosts classification performance in low-data settings through high-res synthetic augmentation.

Results



Conclusion
In this work, we presented a framework for synthesizing ultra-high resolution medical images by fine-tuning SDXL using low-rank adaptation (LoRA) and in- corporating a progressive upscaling module. By optimizing only a lightweight set of parameters, our approach efficiently learns clinically meaningful concepts from textual prompts while preserving the expressive power of large-scale pre- trained models. The integration of progressive upscaling—via an iterative ‘up- sample–diffuse–denoise’ process, skip residuals, and dilated sampling—enables the generation of anatomically coherent images at high resolutions. Through both quantitative metrics and downstream classification tasks, we demonstrate that the synthesized images not only exhibit high perceptual quality but also serve as valuable assets for data augmentation, improving generalization to clin- ical datasets. A primary limitation of our model is the tendency to halluci- nate fine-grained structures when scaling to extreme resolutions (e.g., beyond 2048 × 2048), a known issue in progressive upscaling approaches where artificial detail may be introduced during denoising. Our method offers a scalable and accessible pathway for generating high-resolution medical images that can be leveraged to improve model explainability and support robustness testing.
BibTeX
@article{tehraninasab2025pixelperfect, title = {Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images}, author = {TehraniNasab, Zahra and Kumar, Amar and Arbel, Tal}, journal = {arXiv preprint arXiv:2507.12698}, year = {2025} }