Knowledge Distillation Based on Dual-Path Projection Layer and Attention Mechanism

Knowledge Distillation Based on Dual-Path Projection Layer and Attention Mechanism

Introduction

Knowledge distillation (KD) has emerged as a powerful technique for model compression, enabling the transfer of knowledge from a large, well-trained teacher model to a smaller, more efficient student model. Traditional KD methods primarily focus on knowledge representation, loss function design, and distillation position selection, often overlooking the critical aspects of feature alignment and fusion. This limitation restricts the student model’s learning capacity, leading to suboptimal performance. To address these challenges, this paper introduces a novel knowledge distillation method that leverages dual-path projection layers and an attention mechanism to enhance feature alignment and fusion, significantly improving the student model’s accuracy.

The proposed method, termed Dual-Path Projection Layer and Attention-based Knowledge Distillation (DPA-KD), aligns student and teacher features both spatially and channel-wise. It employs an adapter module with an integrated attention mechanism to ensure balanced multi-scale feature fusion. Additionally, a lightweight parallel attention mechanism (LPA) facilitates deep feature fusion while utilizing the teacher model’s discriminative classifier for inference. Experimental results on CIFAR-100 and Tiny-ImageNet datasets demonstrate substantial improvements in student model performance, validating the effectiveness of feature alignment and fusion in knowledge distillation.

Background and Motivation

Convolutional Neural Networks (CNNs) have achieved remarkable success in computer vision tasks such as image classification, object detection, and semantic segmentation. However, as model performance improves, their computational and memory demands also increase, making deployment on edge devices challenging. Knowledge distillation offers a solution by transferring knowledge from a large teacher model to a compact student model, reducing computational overhead while maintaining performance.

Existing KD methods can be broadly categorized into response-based, relation-based, and feature-based distillation. Response-based methods align the output layers of teacher and student models but fail to utilize intermediate feature representations. Feature-based distillation, on the other hand, mimics intermediate layer outputs, directly enhancing student model performance. However, these methods often involve complex parameter tuning and increased computational complexity.

Recent approaches, such as SimKD, reuse the teacher model’s classifier for student inference and employ L2 loss for feature alignment. While effective, these methods rely on a single projection layer, which may not fully capture diverse feature representations. Additionally, most distillation techniques lack mechanisms to distinguish feature importance, leading to inefficient knowledge transfer.

To overcome these limitations, DPA-KD introduces a dual-path projection strategy, an attention-based adapter module, and a lightweight parallel attention mechanism. These innovations collectively improve feature alignment, reduce computational costs, and enhance the student model’s ability to focus on critical features.

Methodology

Dual-Path Projection Layer Strategy

The dual-path projection layer strategy is designed to enhance feature alignment between teacher and student models. Unlike traditional single-path approaches, this method employs two distinct projection layers, each serving a unique purpose.

The first projection layer focuses on basic feature alignment and dimension matching. It consists of a series of convolutional operations, including 1×1 convolutions for channel reduction, batch normalization for stability, and ReLU activation for non-linearity. A 3×3 depthwise convolution further refines local feature representations before expanding the channels to match the teacher model’s dimensions.

The second projection layer adopts a more sophisticated design, utilizing depthwise separable convolutions and feature concatenation. This layer enhances feature diversity by processing inputs through multiple depthwise separable convolutions, each followed by batch normalization and ReLU activation. The outputs are concatenated and refined using a 1×1 convolution, ensuring seamless integration with subsequent layers.

Spatial alignment is achieved through nearest-neighbor interpolation, adjusting student feature dimensions to match those of the teacher model. Channel alignment is then performed using the dual-path projection layers, ensuring compatibility between student and teacher features.

Adapter Module with Attention Mechanism

The adapter module plays a crucial role in feature matching and fusion. It integrates global information extraction, attention mechanisms, residual connections, and scaling factors to enhance feature representation.

The global information extraction module captures spatial context using adaptive average pooling, flattening, and fully connected layers. This module summarizes feature statistics and reconstructs them for further processing.

The attention mechanism, based on SimAM, dynamically adjusts feature weights to emphasize important regions while suppressing irrelevant details. Residual connections preserve original feature information, preventing information loss during transformation. A scaling factor fine-tunes the influence of enhanced features, optimizing knowledge transfer.

The adapter module combines global features, attention-weighted features, and residual features, producing a refined output that improves student model learning.

Lightweight Parallel Attention Mechanism (LPA)

The LPA module enhances feature fusion through a multi-branch architecture, including local, global, and sequential convolutional branches.

The local branch processes features using non-overlapping patches, computing attention weights to highlight fine-grained details. The global branch follows a similar approach but focuses on broader contextual information. The sequential convolutional branch employs a series of 3×3 convolutions to extract hierarchical features.

Feature fusion is performed using an attention module combining Efficient Channel Attention (ECA) and SimAM. This module refines features by computing channel and spatial attention weights, ensuring optimal feature representation. A feature enhancement module further adjusts student features based on their similarity to teacher features, improving distillation efficiency.

Experimental Results

Performance on CIFAR-100

Experiments on CIFAR-100 demonstrate the superiority of DPA-KD over baseline and state-of-the-art distillation methods. In teacher-student pairs with identical architectures (e.g., WRN-40-2 and WRN-40-1), DPA-KD achieves a 5.85% improvement in top-1 accuracy compared to the baseline student model. Even when compared to advanced methods like SimKD and CAT-KD, DPA-KD maintains a performance lead.

For dissimilar architectures (e.g., ResNet-32×4 and WRN-16-2), DPA-KD outperforms KD by 4.99% and SimKD by 3.39%. Notably, in several cases, the student model’s accuracy surpasses that of the teacher model, highlighting the method’s effectiveness.

Performance on Tiny-ImageNet

DPA-KD’s robustness is further validated on the more complex Tiny-ImageNet dataset. In the VGG-19 to VGG-8 distillation task, DPA-KD improves accuracy by 6.72% over traditional KD and 2.63% over SimKD. For ResNet-34 to ResNet-10 distillation, the gains are even more pronounced, with a 7.18% improvement over KD and 2.26% over SimKD.

Ablation Study

Ablation studies confirm the contributions of each DPA-KD component. The dual-path projection layer alone provides a noticeable improvement over single-path approaches. Adding the adapter module further enhances performance, while the lightweight parallel attention mechanism delivers the most significant gains. The full DPA-KD configuration, combining all components, achieves the highest accuracy, validating the synergistic effect of feature alignment and fusion.

Conclusion

The proposed DPA-KD method addresses key limitations in existing knowledge distillation techniques by introducing dual-path projection layers, an attention-based adapter module, and a lightweight parallel attention mechanism. These innovations enable precise feature alignment, efficient multi-scale fusion, and enhanced student model performance.

Experimental results on CIFAR-100 and Tiny-ImageNet demonstrate consistent improvements over baseline and state-of-the-art methods. The student models not only achieve higher accuracy but, in some cases, surpass their teacher models. While the additional projection layers introduce computational overhead, the performance gains justify their inclusion.

Future research could explore adaptive mechanisms to dynamically adjust projection layer complexity, further optimizing computational efficiency. Additionally, extending DPA-KD to other domains, such as natural language processing or reinforcement learning, could broaden its applicability.

doi.org/10.19734/j.issn.1001-3695.2024.07.0229

Was this helpful?

0 / 0