Local Customized Image Editing Algorithm Based on Diffusion Model Fine-Tuning
Introduction
Recent advancements in diffusion models and large-scale multimodal models have significantly propelled the field of AI-generated content (AIGC), particularly in applications such as image generation, AI painting, and artistic creation. In computer vision, image editing remains a core research challenge, requiring algorithms to accurately segment editing regions, generate coherent content, and maintain consistency between edited and non-edited areas. Diffusion models, with their powerful generative capabilities, have become increasingly popular for image editing tasks. However, existing methods often struggle with flexible control over editing regions and generating personalized content.
This paper introduces a Local Customized Diffusion (LCDiffusion) algorithm that addresses these limitations by leveraging fine-tuned diffusion models for precise, user-guided image editing. The proposed method combines concept embedding learning, parameter-efficient fine-tuning, and localized region selection to enable high-quality, customizable image editing while preserving non-edited regions.
Background and Related Work
Text-to-Image Diffusion Models
Diffusion models have emerged as a leading approach for high-quality image synthesis. Unlike generative adversarial networks (GANs), diffusion models progressively denoise random noise to generate images, offering better stability and detail preservation. Stable Diffusion (SD), a latent diffusion model (LDM), operates in a compressed latent space to reduce computational costs while maintaining high-resolution outputs.
Text-to-image diffusion models, such as Imagen and Stable Diffusion, use text prompts to guide image generation. These models rely on cross-attention mechanisms to align text embeddings with visual features, enabling semantic control over generated content. However, standard diffusion models lack the ability to incorporate custom concepts—specific objects or styles defined by users—without extensive retraining.
Personalized Image Generation
Several approaches attempt to personalize diffusion models:
• DreamBooth fine-tunes the entire model on a small set of images to bind a custom concept to a pseudo-word.
• Textual Inversion learns a new word embedding for a custom concept while keeping the model frozen.
• Custom Diffusion reduces overfitting by using regularization images and fine-tuning only key layers.
Despite these advances, existing methods either require excessive computational resources or fail to provide localized editing control, often altering unintended regions.
Methodology
The proposed LCDiffusion method consists of two main phases:
- Fine-tuning the diffusion model to learn custom concepts efficiently.
- Localized image editing using segmentation masks for precise control.
Fine-Tuning the Diffusion Model
Stable Diffusion as Base Framework
The method builds on Stable Diffusion v1.5, which uses a variational autoencoder (VAE) to compress images into latent representations and a U-Net-based diffusion model for denoising.
Parameter-Efficient Fine-Tuning
Full fine-tuning of large diffusion models is computationally expensive and prone to overfitting. Instead, LCDiffusion employs selective fine-tuning:
• Analysis of Layer-Wise Parameter Changes: Experiments reveal that cross-attention layers undergo the most significant updates during fine-tuning, as they mediate text-image alignment.
• Frozen Base Model with Key-Value (K-V) Updates: Only the key and value matrices in cross-attention layers are fine-tuned, reducing trainable parameters by over 80% compared to full fine-tuning.
Custom Concept Learning with LoRA
To bind a custom concept (e.g., a user’s pet) to a pseudo-word [P], LCDiffusion:
- Uses a small set of reference images (3–5) and a text template (“A photo of [P] [class]”).
-
Optimizes the pseudo-word embedding and K-V matrices via a combined loss:
• Reconstruction loss ensures the generated image matches the input.• Prior preservation loss prevents catastrophic forgetting of the base model’s knowledge.
Regularization for Overfitting Mitigation
To prevent overfitting on limited training data, LCDiffusion introduces regularization images from the LAION-400M dataset. These images, selected via CLIP similarity scoring, help maintain generalization.
Localized Image Editing
Segmentation-Guided Masking
Precise editing requires accurate region selection. LCDiffusion uses Segment Anything Model (SAM) to:
- Segment the input image into editable regions.
- Generate a binary mask distinguishing editable (target) and non-editable (preserved) areas.
Conditional Generation with Masks
During inference, the model takes:
- The input image (for structural reference).
- The SAM-generated mask (to localize edits).
- A text prompt (to guide content generation).
The diffusion process is confined to the masked region, ensuring non-edited areas remain unchanged.
Experiments and Results
Experimental Setup
• Dataset: DreamBench (30 categories, 5 images per category).
• Metrics:
• CLIP-T: Text-image alignment (higher = better).
• CLIP-I/DINO-I: Image similarity to ground truth.
• MS-SSIM: Structural consistency.
• User Study: Realism and edit accuracy (1–5 scale).
Quantitative Comparison
LCDiffusion outperforms baselines in key metrics:
| Method | CLIP-T ↑ | MS-SSIM ↑ | Training Time (h) ↓ |
|---|---|---|---|
| DiffEdit | 0.1539 | 0.6659 | 80 |
| Custom Diffusion | 0.2892 | 0.7129 | 3 |
| DreamBooth | 0.3011 | 0.7373 | 4 |
| LCDiffusion | 0.3378 | 0.8401 | 2 |
Key findings:
• 12.2% higher CLIP-T than Custom Diffusion, indicating superior text alignment.
• 13.9% higher MS-SSIM than DreamBooth, showing better structural preservation.
• 2-hour training (vs. 3–80 hours for baselines) due to parameter-efficient fine-tuning.
Qualitative Results
Visual comparisons demonstrate LCDiffusion’s advantages:
• Precise Local Edits: Only the masked region (e.g., a dog) is altered, while the background remains intact.
• Concept Consistency: Custom concepts (e.g., a specific dog breed) are faithfully reproduced.
• Minimal Artifacts: Unlike DiffEdit, LCDiffusion avoids blurring or distortion in non-edited regions.
Ablation Studies
Training Parameter Optimization
• Full fine-tuning: 291.95M parameters, 80 hours.
• LCDiffusion (K-V only): 43.25M parameters, 2 hours.
Local Region Selection
• Without masks: LPIPS = 0.6435 (high distortion).
• With SAM masks: LPIPS = 0.3429 (47% improvement).
Custom Concept Generation
• Without concept binding: CLIP-T = 0.2485.
• With LCDiffusion: CLIP-T = 0.6725 (170% gain).
Conclusion
LCDiffusion advances diffusion-based image editing by:
- Efficient Fine-Tuning: Selective updates to cross-attention layers reduce training time and resources.
- Localized Control: SAM-guided masks enable precise edits without altering non-target regions.
- Custom Concept Integration: Pseudo-word binding allows personalized generation with minimal data.
Future work will focus on real-time editing and facial customization, expanding the method’s applicability.
doi.org/10.19734/j.issn.1001-3695.2024.04.0175
Was this helpful?
0 / 0