Bilateral Parallel Local Attention Vision Transformer for Small Datasets

Bilateral Parallel Local Attention Vision Transformer for Small Datasets

Introduction

Vision Transformers (ViTs) have emerged as a powerful alternative to convolutional neural networks (CNNs) in computer vision tasks, demonstrating superior performance in large-scale datasets. However, when trained from scratch on small datasets, ViTs struggle to match the performance of CNNs of comparable scale. This limitation arises primarily due to the lack of strong inductive biases in ViTs, which CNNs inherently possess through their local receptive fields and weight-sharing mechanisms.

To address this challenge, researchers have explored various approaches to enhance the data efficiency of ViTs, such as incorporating convolutional operations or hierarchical structures. While these methods have improved performance on medium-sized datasets like ImageNet-1K, a significant performance gap remains when training on smaller datasets. This paper introduces a novel architecture called the Bilateral Parallel Local Attention Vision Transformer (BPLAT), which combines image-based and semantics-based local attention mechanisms to improve ViT performance on small datasets.

Background and Motivation

Traditional ViTs rely on global self-attention mechanisms that compute relationships between all image patches, leading to high computational complexity and a strong dependence on large-scale training data. When trained on small datasets, ViTs often fail to learn meaningful local representations, resulting in suboptimal performance. Recent studies have shown that introducing local inductive biases can significantly improve ViT data efficiency.

Image-based local attention methods, such as those used in Swin Transformers, partition the image into local windows and compute self-attention within these windows. While this approach reduces computational complexity and enhances local feature extraction, it may lose important long-range dependencies between semantically related but spatially distant patches. Conversely, semantics-based local attention focuses on grouping patches based on their feature similarities rather than their spatial proximity, potentially capturing more meaningful relationships.

The key insight behind BPLAT is that combining these two complementary attention mechanisms can enhance the model’s ability to learn from limited data while maintaining computational efficiency.

Methodology

Overview of BPLAT

BPLAT replaces the standard multi-head self-attention (MHSA) module in ViTs with two parallel branches: an Image-Based Local Attention (IBLA) module and a Semantics-Based Local Attention (SBLA) module. The IBLA module operates similarly to Swin Transformer’s windowed attention, computing self-attention within fixed spatial windows. The SBLA module, on the other hand, groups patches based on their feature similarities using K-means clustering and computes attention within these semantic clusters.

The outputs of both modules are concatenated and processed through a feed-forward network with depthwise separable convolutions to further enhance local feature extraction. This parallel architecture allows the model to simultaneously capture both spatially local and semantically related features, improving its ability to learn from small datasets.

Image-Based Local Attention (IBLA)

The IBLA module follows the design principles of Swin Transformer, where the input image is divided into non-overlapping local windows. Self-attention is computed independently within each window, significantly reducing computational complexity compared to global attention. This approach introduces a strong spatial inductive bias, forcing the model to focus on local patterns first before potentially learning longer-range dependencies in deeper layers.

Semantics-Based Local Attention (SBLA)

The SBLA module represents the key innovation of BPLAT. Instead of grouping patches based on their spatial locations, SBLA clusters patches based on their feature similarities in the embedding space. This is achieved through spherical K-means clustering, where:

  1. Learnable cluster centroids are maintained as model parameters
  2. Input patches are assigned to clusters based on their similarity to these centroids
  3. Self-attention is computed only within each cluster

This approach allows the model to establish connections between semantically similar patches regardless of their spatial distance in the image. For example, different parts of an object or similar objects in different image regions can be grouped together, enabling more efficient learning of meaningful visual concepts.

The clustering process is implemented efficiently on GPUs by:
• Normalizing queries and keys to unit length for spherical clustering

• Using top-k selection to ensure balanced cluster sizes

• Updating cluster centroids via exponential moving average during training

Training and Optimization

BPLAT is trained with a combination of standard classification loss (cross-entropy) and a clustering loss that encourages compact, well-separated clusters. The model uses standard ViT training techniques including AdamW optimization, cosine learning rate scheduling, and extensive data augmentation.

Experimental Results

Performance on Small Datasets

BPLAT was evaluated on several small-scale datasets including CIFAR-10, CIFAR-100, and subsets of DomainNet. The results demonstrate significant improvements over both conventional ViTs and strong CNN baselines:

On CIFAR-10:
• BPLAT-T (tiny variant) achieves 96.95% accuracy with only 5.8M parameters

• BPLAT-S (small variant) reaches 97.51% accuracy

• BPLAT-B (base variant) achieves state-of-the-art 97.93% accuracy

On CIFAR-100:
• BPLAT models outperform all competing ViT variants

• The base model achieves 85.80% accuracy, surpassing CNN baselines like DenseNet and ResNeXt

These results are particularly notable given the small parameter counts and computational requirements of BPLAT compared to traditional ViTs.

Analysis and Ablation Studies

Several ablation studies were conducted to understand the contributions of different components:

  1. Effectiveness of SBLA: When used alone, SBLA achieves comparable performance to IBLA (83.06% vs 83.01% on CIFAR-100), but their combination yields the best results (83.63%), demonstrating complementary benefits.

  2. Attention Head Configuration: Optimal performance is achieved when splitting attention heads equally between IBLA and SBLA branches.

  3. Cluster Size: Experiments show that setting each cluster to contain 64 patches (out of 256 total) works best, balancing between information richness and noise reduction.

Visualization and Interpretation

Attention map visualizations reveal that SBLA successfully groups semantically related patches across the image. For example:
• In motorcycle images, SBLA clusters different motorcycle parts together

• In nature scenes, similar branches from different trees are grouped despite spatial separation

This demonstrates SBLA’s ability to capture meaningful non-local relationships that IBLA might miss.

Discussion

The success of BPLAT can be attributed to several key factors:

  1. Enhanced Inductive Biases: By combining spatial and semantic locality constraints, BPLAT provides stronger guidance for learning from limited data.

  2. Computational Efficiency: Both attention mechanisms maintain linear complexity with respect to input size, making BPLAT practical for various applications.

  3. Complementary Representations: IBLA and SBLA capture different aspects of visual structure, with IBLA excelling at local texture patterns and SBLA at object-level semantics.

While BPLAT shows excellent performance on small datasets, there remains room for improvement in scaling to larger datasets where hierarchical ViT architectures still hold an advantage. This suggests potential future directions in combining BPLAT’s approach with hierarchical designs.

Conclusion

The Bilateral Parallel Local Attention Vision Transformer presents an effective solution to the challenge of training ViTs on small datasets. By innovatively combining image-based and semantics-based local attention mechanisms, BPLAT achieves superior performance while maintaining computational efficiency. The architecture’s ability to capture both spatial and semantic relationships makes it particularly suitable for applications where training data is limited but model efficiency is crucial.

Future research directions include extending this approach to other vision tasks like object detection and segmentation, as well as investigating more sophisticated clustering algorithms for the semantics-based attention branch. The success of BPLAT opens new possibilities for deploying transformer-based models in resource-constrained scenarios where large-scale pretraining is impractical.

doi.org/10.19734/j.issn.1001-3695.2023.11.0643

Was this helpful?

0 / 0