Unified Efficient Fine-Tuning Framework Based on Efficient Tuning Methods and Its Applications

Unified Efficient Fine-Tuning Framework Based on Efficient Tuning Methods and Its Applications

Introduction

The rapid development of large-scale pre-trained models has revolutionized both natural language processing (NLP) and computer vision (CV). However, fine-tuning these massive models for downstream tasks remains computationally expensive and resource-intensive. To address this challenge, parameter-efficient fine-tuning (PEFT) methods have emerged, allowing models to adapt to new tasks with minimal parameter updates. While individual PEFT techniques such as adapters, prefix tuning, and low-rank adaptation (LoRA) have shown promising results, their combined potential remains largely unexplored. Existing approaches often integrate different methods independently without considering their underlying similarities, leading to suboptimal efficiency and performance.

This paper introduces the Efficient Transformer Tuning Architecture (ETTA), a unified framework that systematically combines parallel adapters and a novel scaled prefix tuning variant. By analyzing the fundamental similarities between adapters and prefix tuning, we establish a theoretical foundation for their integration. The proposed architecture is optimized for computer vision tasks, including image classification and object detection, demonstrating that minimal parameter updates can achieve performance comparable to full fine-tuning.

Background and Related Work

Parameter-Efficient Fine-Tuning Methods

Traditional fine-tuning involves updating all parameters of a pre-trained model, which becomes impractical as model sizes grow. PEFT methods address this by freezing most parameters and introducing lightweight trainable components.

Adapters are small neural modules inserted between layers of a pre-trained model. The original serial adapters process features sequentially, but parallel adapters have since been developed to operate concurrently with the main model, improving efficiency and performance.

Prefix tuning prepends trainable vectors to the key and value matrices in the attention mechanism, allowing the model to learn task-specific contextual cues. While effective, standard prefix tuning requires a large number of parameters to achieve strong performance.

Low-Rank Adaptation (LoRA) decomposes weight updates into low-rank matrices, reducing trainable parameters while preserving model capacity.

Recent studies have explored combining these methods, but most simply insert multiple PEFT modules without optimizing their interactions. A unified approach that leverages their shared principles could significantly improve efficiency.

Transformer Models in Computer Vision

Originally developed for NLP, Transformer architectures have been successfully adapted to vision tasks through models like Vision Transformer (ViT) and Detection Transformer (DETR). These models rely on self-attention mechanisms to process image patches or object queries, making them suitable for PEFT techniques developed in NLP.

Previous work has applied adapters and LoRA to vision Transformers, but no unified framework has been proposed to integrate multiple PEFT methods systematically.

Unified Efficient Fine-Tuning Architecture

Similarities Between Adapters and Prefix Tuning

A key insight of this work is that adapters and prefix tuning share fundamental operational principles. Both methods apply a transformation to the input features, followed by a residual connection to preserve the original model’s behavior.

For prefix tuning, the attention mechanism can be decomposed into contributions from the original input and the prefix vectors. This decomposition reveals that prefix tuning effectively applies a parallel transformation similar to adapters, where the prefix vectors act as an auxiliary pathway.

This similarity suggests that both methods can be unified under a shared parameter budget, with adapters handling feed-forward transformations and prefix tuning specializing in attention-based adaptations.

Scaled Prefix Tuning Variant

Standard prefix tuning requires a large number of prefix vectors to achieve strong performance, increasing parameter overhead. To address this, we introduce a scaled prefix tuning variant inspired by LoRA’s learnable scaling factor.

By incorporating a tunable scalar multiplier, the scaled prefix variant achieves higher precision with fewer parameters. Experiments show that this modification allows the model to match or exceed the performance of standard prefix tuning while using significantly fewer trainable parameters.

Architecture Configuration

ETTA integrates parallel adapters and scaled prefix tuning into a cohesive framework. The optimal configuration places:

• Parallel adapters in the feed-forward network (FFN) layers with a high bottleneck dimension (~400). The FFN’s global processing benefits from larger adapter modules.

• Scaled prefix tuning in the multi-head attention layers with a small number of prefix vectors (~50). Attention mechanisms are more sensitive to local adjustments, requiring fewer parameters.

This allocation strategy maximizes performance while minimizing redundant parameters. The total parameter budget is distributed as:

[ U = max(U{text{parallel adapter}}) + min(U{text{scaled prefix tuning}}) ]

Application to Vision Tasks

ETTA is designed for compatibility with standard vision Transformer models:

  1. ET-ViT: Integrates ETTA into ViT by inserting parallel adapters after FFN layers and scaled prefix tuning in the attention modules.
  2. ET-DETR: Adapts DETR for efficient fine-tuning, reducing training time and resource requirements.
  3. ET-Deformable DETR: Extends the approach to Deformable DETR, maintaining performance with minimal parameter updates.

These adaptations demonstrate ETTA’s versatility across classification and detection tasks.

Experimental Results

Validation of Scaled Prefix Tuning

Experiments on ViT-L/16 with CIFAR-100 compare scaled prefix tuning against standard prefix tuning:

• With just 1.1% of parameters tuned, scaled prefix tuning achieves 83.2% accuracy, outperforming standard prefix tuning at 2% parameter tuning (80.2%).

• At 7% parameter tuning, scaled prefix tuning reaches 94.8% accuracy, matching standard prefix tuning’s performance at 10% tuning.

These results confirm that the scaled variant improves parameter efficiency without sacrificing accuracy.

Optimal Configuration Analysis

Testing different adapter and prefix tuning placements on ImageNet-1k reveals:

• Parallel adapters in FFN layers outperform those in attention layers, as FFNs benefit more from high-dimensional transformations.

• Combining high-bottleneck adapters (b=400) with low-prefix vectors (l=50) yields the best accuracy (87.5%) at just 4.2% parameter tuning.

This configuration forms the basis of ETTA, balancing parameter efficiency and task performance.

Performance on Vision Tasks

Image Classification

On CIFAR-100:
• Full fine-tuning achieves 84.4% accuracy.

• ET-ViT reaches 84.5% accuracy with only 5.6% of parameters tuned.

On ImageNet-1k:
• Full fine-tuning achieves 87.76% accuracy.

• ET-ViT attains 87.5% accuracy with 4% parameter tuning.

Object Detection

On COCO2017:
• Full fine-tuning of DETR-R101 yields 43.5 AP.

• ET-DETR achieves 42.9 AP with 6.2% parameter tuning.

On BigDetection:
• Full fine-tuning reaches 31.3 AP.

• ET-DETR surpasses this with 31.5 AP at 5.4% tuning.

Similar results hold for Deformable DETR, demonstrating ETTA’s broad applicability.

Comparison with Other PEFT Methods

Benchmarking against adapters, LoRA-adapters, and prefix tuning shows:

  1. Image Classification (CIFAR-100):
    • ETTA achieves 84.5% accuracy (5.6% parameters).

    • Adapters: 77.4% (5.2%), LoRA-adapters: 81.3% (6.1%), prefix tuning: 75.4% (4.7%).

  2. Object Detection (COCO2017):
    • ETTA reaches 42.9 AP (6.2% parameters).

    • Adapters: 36.7 AP (6.1%), LoRA-adapters: 39.8 AP (7.4%), prefix tuning: 31.4 AP (5.5%).

ETTA consistently outperforms other methods in accuracy while maintaining competitive parameter efficiency. Additionally, it reduces training time and GPU memory usage compared to standalone PEFT modules.

Conclusion

The ETTA framework presents a systematic approach to unifying adapter and prefix tuning methods for efficient fine-tuning of Transformer models. By analyzing their underlying similarities and optimizing their integration, ETTA achieves near-full fine-tuning performance with minimal parameter updates.

Key contributions include:

  1. Theoretical justification for combining adapters and prefix tuning based on shared operational principles.
  2. Introduction of scaled prefix tuning, which enhances parameter efficiency.
  3. A unified configuration strategy that maximizes performance for vision tasks.

Experiments on image classification and object detection validate ETTA’s effectiveness, demonstrating superior accuracy and efficiency compared to existing PEFT methods.

Future work will explore ETTA’s applicability to additional vision models and tasks, further refining its parameter allocation strategies. The framework’s modular design also opens possibilities for incorporating other PEFT techniques, paving the way for next-generation efficient fine-tuning solutions.

doi.org/10.19734/j.issn.1001-3695.2024.07.0264

Was this helpful?

0 / 0