A Comprehensive Overview of Semi-Supervised Under-Sampling Method for Anomaly Detection in Industrial Control Systems with Class Imbalance and Overlap

A Comprehensive Overview of Semi-Supervised Under-Sampling Method for Anomaly Detection in Industrial Control Systems with Class Imbalance and Overlap

Introduction

Industrial Control Systems (ICS) play a critical role in monitoring and controlling industrial processes across vital infrastructure sectors such as energy, transportation, and power distribution. The security of these systems has become a national priority due to increasing cyber threats. Anomaly detection is a crucial component of ICS security, aiming to identify potential faults or malicious activities that could compromise system reliability. However, ICS anomaly detection faces significant challenges, including limited labeled data, class imbalance, and class overlap, which collectively degrade the performance of conventional classifiers.

Class imbalance occurs when one class (typically the minority class representing anomalies) is significantly underrepresented compared to the majority class (normal operations). Class overlap refers to regions in the feature space where samples from different classes exhibit similar characteristics, making them difficult to distinguish. These two problems often coexist, creating a complex scenario that traditional supervised learning methods struggle to address. While data-level and algorithm-level approaches have been proposed to mitigate these issues, existing methods suffer from inaccurate pseudo-labeling, unstable sampling performance, and poor overlap detection rates.

To overcome these limitations, this paper introduces a Semi-Supervised Learning Under-Sampling method derived from Label Propagation (SSLU-LP). This approach integrates heterogeneous ensemble learning with label propagation and one-class classification to generate accurate pseudo-labels for unlabeled data. Additionally, it employs a Minimum Spanning Tree (MST) strategy to detect overlapping regions and a nearest-neighbor-based under-sampling technique to selectively remove majority class samples. The proposed method is evaluated on nine ICS datasets and compared against multiple baseline algorithms, demonstrating superior performance in anomaly detection.

Background and Challenges

Class Imbalance and Overlap in ICS Data

Class imbalance is particularly problematic in ICS anomaly detection because anomalies are rare events compared to normal operations. Traditional classifiers tend to be biased toward the majority class, leading to poor detection rates for minority class instances. Meanwhile, class overlap exacerbates the problem by introducing ambiguity in decision boundaries. When samples from different classes share similar feature representations, classifiers struggle to distinguish between them, resulting in increased misclassification rates.

Limitations of Existing Methods

Current approaches to handling class imbalance and overlap can be broadly categorized into data-level and algorithm-level methods. Data-level techniques, such as oversampling and under-sampling, aim to rebalance datasets by either generating synthetic minority samples or removing redundant majority samples. However, oversampling can introduce noise and worsen overlap, while random under-sampling may discard informative majority samples.

Algorithm-level methods modify classifier training to prioritize minority or overlapping samples. While effective in some cases, these approaches often require extensive labeled data, which is scarce in ICS environments. Semi-supervised learning (SSL) has emerged as a promising alternative, leveraging both labeled and unlabeled data to improve model generalization. However, existing SSL methods rely on low-density assumptions, where decision boundaries are presumed to lie in sparsely populated regions. This assumption does not always hold in ICS data, leading to inaccurate label propagation.

Proposed Method: SSLU-LP

The SSLU-LP framework addresses these challenges through a three-stage process: pseudo-label generation, overlap region detection, and under-sampling.

Pseudo-Label Generation

The first stage employs a heterogeneous ensemble approach combining One-Class Support Vector Machine (OCSVM) and Label Propagation (LP) to assign pseudo-labels to unlabeled data. OCSVM is used to identify an initial decision boundary by separating normal operations from potential anomalies. The LP mechanism then propagates labels to neighboring unlabeled samples based on feature similarity. A dual-validation mechanism ensures label consistency, improving pseudo-label accuracy.

Overlap Region Detection

Once pseudo-labels are assigned, the method identifies overlapping regions using an MST-based strategy. The MST algorithm constructs a graph where edge weights represent distances between samples. Regions with high inter-class connectivity are flagged as overlapping, as they contain samples from different classes that are difficult to distinguish.

Under-Sampling Strategy

In the final stage, a selective under-sampling technique removes majority class samples from overlapping regions. The algorithm computes intra-class and inter-class nearest-neighbor distances for minority samples. If a minority sample’s nearest neighbor from the majority class is closer than its nearest neighbor from the same class, the majority sample is removed. This preserves critical decision boundaries while reducing class imbalance.

Experimental Evaluation

Datasets and Baselines

The proposed method was tested on nine publicly available ICS datasets, including BHP, Gas Pipeline, Power, NSLKDD, ISCX, WST, Firewall, SWaT, and WADI. These datasets exhibit varying degrees of class imbalance and overlap, with missing label ratios ranging from 15% to 52%. SSLU-LP was compared against nine hybrid methods combining different semi-supervised learning and under-sampling techniques, as well as ten baseline under-sampling algorithms.

Performance Metrics

Three key metrics were used to evaluate performance:

  1. Sensitivity (Recall): Measures the ability to correctly identify minority class samples.
  2. Area Under the Curve (AUC): Evaluates the trade-off between true positive and false positive rates.
  3. G-mean: The geometric mean of sensitivity and specificity, providing a balanced assessment of classifier performance.

Results

SSLU-LP consistently outperformed baseline methods across all datasets. For instance, on the BHP dataset, SSLU-LP achieved a sensitivity of 76.19%, a 49.16% improvement over the next best method. Similarly, on the WADI dataset, it achieved an AUC of 99.65%, significantly higher than competing approaches. The method also demonstrated robustness in handling varying levels of class imbalance and overlap, proving its adaptability to different ICS environments.

Computational Efficiency

SSLU-LP exhibited competitive computational efficiency, with runtime advantages over many baseline methods. For example, on the BHP dataset, it completed execution in 0.15 seconds, compared to an average of 1.087 seconds for other methods. This efficiency makes it suitable for real-time ICS applications where computational resources may be limited.

Conclusion

The SSLU-LP method presents a novel solution to the coupled challenges of class imbalance and overlap in ICS anomaly detection. By integrating semi-supervised learning with advanced under-sampling techniques, it effectively addresses the limitations of existing approaches. Experimental results confirm its superiority in generating accurate pseudo-labels, detecting overlapping regions, and improving classifier performance. The method’s efficiency and adaptability make it a practical choice for enhancing ICS security in real-world scenarios.

doi.org/10.19734/j.issn.1001-3695.2024.06.0195

Was this helpful?

0 / 0