Image Inpainting with Bidirectional-Aware Transformer and Frequency Analysis Strategies

Image Inpainting with Bidirectional-Aware Transformer and Frequency Analysis Strategies

Introduction

Image inpainting, the task of reconstructing missing or damaged regions in an image, has evolved beyond simple pixel restoration to encompass semantic understanding and high-fidelity content generation. Despite advancements, existing methods often struggle with producing visually coherent results, particularly in large missing regions or complex textures. Two primary challenges hinder progress: (1) the degradation of high-frequency content, leading to spectral structure deviations, and (2) the limited receptive field of convolutional neural networks (CNNs), which restricts their ability to model non-local relationships effectively.

To address these issues, this paper introduces BAT-Freq (Bidirectional-Aware Transformer and Frequency Analysis), a novel image inpainting network that combines a bidirectional-aware Transformer with a frequency-guided reconstruction strategy. The proposed method enhances both global structural coherence and fine-grained texture details by leveraging multi-scale frequency decomposition and long-range contextual modeling.

Background and Motivation

Challenges in Image Inpainting

Traditional image inpainting methods, such as diffusion-based and patch-based techniques, rely on copying and aligning background pixels into missing regions. While effective for small gaps, these approaches fail to generate semantically meaningful content for large missing areas. Deep learning-based methods, particularly those using CNNs and generative adversarial networks (GANs), have improved performance by learning from large datasets. However, CNNs suffer from limited receptive fields, often resulting in blurry or artifact-laden outputs.

Transformers, widely successful in natural language processing, have shown promise in vision tasks due to their ability to capture long-range dependencies via self-attention. However, their computational complexity grows quadratically with input size, making them difficult to train for high-resolution images. Additionally, existing methods often neglect the importance of frequency-domain information, leading to imbalanced reconstructions where high-frequency details are suppressed in favor of low-frequency structures.

Key Innovations

BAT-Freq addresses these limitations through two main innovations:

  1. Bidirectional-Aware Transformer (BAT): A novel Transformer architecture that expands the receptive field using n-gram-based bidirectional context modeling and dense uniform attention (DUA). This enables the network to capture both local and global structural relationships efficiently.
  2. Frequency Analysis Guidance: A multi-band frequency decomposition strategy using wavelet transforms, where low-frequency components are reconstructed with L1 loss for structural consistency, and high-frequency components are refined using adversarial training for detail preservation. A Hybrid Feature Adaptive Normalization (HFAN) module aligns spatial and frequency-domain features to minimize artifacts.

Methodology

Overview

The BAT-Freq architecture consists of three core components:

  1. Bidirectional-Aware Transformer Repair Bottleneck: Processes encoded features at multiple resolutions to model long-range dependencies.
  2. Frequency Analysis Guidance Network: Decomposes features into low- and high-frequency sub-bands for separate reconstruction.
  3. Dual Discriminators: A spatial-domain discriminator and a high-frequency discriminator ensure realism in both domains.

Bidirectional-Aware Transformer (BAT)

The BAT module enhances the standard Transformer by introducing two key mechanisms:

  1. N-Gram Bidirectional Context Modeling: Inspired by language models, n-gram units are defined as overlapping local windows where pixels interact via self-attention. By aggregating forward and backward context, the network captures broader structural patterns.
  2. Dense Uniform Attention (DUA): A lightweight attention mechanism that stabilizes training by uniformly distributing attention gradients. DUA supplements self-attention with global average pooling, ensuring dense interactions without excessive computational overhead.

These modifications allow BAT to efficiently model large-scale image structures while maintaining computational feasibility.

Frequency Analysis Guidance

Multi-Band Decomposition

Input features are decomposed using Haar wavelet transforms into: • Low-Frequency (LL): Encodes global structure and smooth regions.

• High-Frequency (LH, HL, HH): Captures edges, textures, and fine details.

Reconstruction Strategies

  1. Low-Frequency Reconstruction (Re-LowFreq): Utilizes residual blocks with dilated convolutions to restore structural coherence. Reconstructed LL components are skip-connected to preserve contour information.
  2. High-Frequency Reconstruction (Re-HighFreq): Adversarial training refines high-frequency details. HFAN aligns these components with spatial features to ensure consistency.

Hybrid Feature Adaptive Normalization (HFAN)

HFAN addresses feature misalignment between domains via:

  1. Intra-Frequency Alignment: Self-attention aggregates low-frequency features to guide high-frequency reconstruction.
  2. Cross-Domain Modulation: Spatial features dynamically adjust frequency-domain statistics using adaptive instance normalization (AdaIN), harmonizing outputs.

Loss Functions

The training objective combines:

  1. Adversarial Loss: Applied to both spatial and high-frequency outputs for realism.
  2. L1 Reconstruction Loss: Constrains low-frequency components for structural fidelity.
  3. Perceptual Loss: Uses VGG-19 features to enhance semantic consistency.

A regularization term (R1) stabilizes discriminator training.

Experiments

Datasets and Baselines

Evaluations were conducted on: • CelebA-HQ: 30,000 aligned face images.

• Place2: 1.8 million natural scene images.

• Paris StreetView: Urban landscape images.

Comparisons were made against state-of-the-art methods: EC, ICT, WaveFill, MAT, CoordFill, and ORFNet.

Quantitative Results

BAT-Freq achieved consistent improvements across metrics: • PSNR: +2.804 dB average gain.

• SSIM: +8.13% higher structural similarity.

• MAE/LPIPS: Reduced by 0.0158 and 0.0962, respectively.

Notably, the method excelled in large-mask scenarios (40–60% occlusion), demonstrating robust structural recovery.

Qualitative Analysis

  1. CelebA-HQ: BAT-Freq restored facial contours and facial features with sharper details compared to competitors, which often produced blurry or semantically inconsistent results.
  2. Paris StreetView: Architectural elements (e.g., windows, doors) were reconstructed with accurate geometry and texture, while other methods introduced distortions.
  3. Place2: Complex natural scenes exhibited coherent textures and fewer artifacts compared to patchy or oversmoothed outputs from baseline models.

Ablation Studies

Key findings: • Removing frequency guidance led to color artifacts and blurring.

• Disabling HFAN caused misaligned features and visible seams.

• BAT outperformed vanilla Swin Transformers in SSIM by 2.1%, validating its context-aware design.

Conclusion

BAT-Freq advances image inpainting by unifying frequency-aware reconstruction with bidirectional global modeling. Its multi-band decomposition alleviates frequency conflicts, while the BAT module ensures semantically plausible content generation. Future work may explore extensions to multi-modal inpainting (e.g., text- or pose-guided generation) and higher-resolution applications.

For further details, refer to the full paper here https://doi.org/10.19734/j.issn.1001-3695.2024.05.0214

Was this helpful?

0 / 0