A Comprehensive Overview of the EMO-GAN-Based Malicious URL Detection Framework

A Comprehensive Overview of the EMO-GAN-Based Malicious URL Detection Framework

Introduction

The widespread adoption of the World Wide Web has been accompanied by an alarming increase in cyber threats, making the security of Uniform Resource Locators (URLs) a critical research focus in cybersecurity. Malicious URLs serve as primary vectors for various cyberattacks, including phishing, SQL injection, and cross-site scripting (XSS). These attacks exploit carefully crafted URLs to deceive users or bypass security mechanisms, leading to severe consequences such as data breaches, financial losses, and identity theft. Traditional approaches to detecting malicious URLs, such as blacklisting, have proven insufficient due to their reactive nature and inability to adapt to evolving threats. Consequently, machine learning-based methods have gained prominence for their ability to identify complex patterns and generalize across diverse attack vectors.

Despite advancements, existing machine learning models face three major challenges: (1) difficulties in acquiring sufficient labeled data, (2) inadequate feature representation, and (3) model concept drift caused by the dynamic nature of cyber threats. To address these challenges, this paper introduces the EMO-GAN-based Malicious URL Detection Framework (EMO-GANUDF), which integrates semi-supervised learning, generative adversarial networks (GANs), and online learning to enhance detection accuracy and adaptability.

Background and Related Work

Malicious URL Detection Approaches

Malicious URL detection methods can be broadly categorized into blacklisting and machine learning-based techniques. Blacklisting relies on maintaining databases of known malicious URLs and comparing incoming URLs against these lists. While simple to implement, blacklists suffer from significant limitations, including delayed updates and susceptibility to evasion tactics. For instance, Prakash et al. proposed PhishNet, which expands blacklist coverage by generating heuristic-based URL variants and employing approximate matching strategies. However, blacklists remain inherently reactive and fail to detect novel threats.

Machine learning-based approaches, on the other hand, leverage feature extraction and classification algorithms to identify malicious URLs. These methods can be further divided into supervised, semi-supervised, and unsupervised techniques. Supervised learning models, such as those based on random forests or convolutional neural networks (CNNs), require large labeled datasets for training. Semi-supervised learning mitigates the need for extensive labeled data by leveraging both labeled and unlabeled samples. For example, Chen et al. demonstrated that semi-supervised learning could improve detection performance even with limited labeled data.

Generative Adversarial Networks in Cybersecurity

GANs have emerged as powerful tools for generating synthetic data that closely resembles real-world samples. In cybersecurity, GANs are used to augment datasets, particularly in scenarios where labeled malicious samples are scarce. For instance, Zheng et al. proposed a GAN-based model to generate synthetic malicious URLs, enhancing the diversity of training data. Similarly, Pham et al. employed WGAN-GP to generate phishing URLs, demonstrating improved detection accuracy.

However, GANs alone are insufficient for addressing the challenges of imbalanced datasets and concept drift. Recent studies have explored the integration of GANs with semi-supervised learning to enhance model robustness. MarginGAN, proposed by Dong et al., incorporates margin theory to mitigate the impact of incorrect pseudo-labels, improving classification performance in semi-supervised settings.

Online Learning for Adaptive Detection

The dynamic nature of cyber threats necessitates models capable of adapting to new attack patterns. Online learning enables continuous model updates using streaming data, addressing concept drift without requiring full retraining. Zhang et al. applied online learning algorithms, such as passive-aggressive and confidence-weighted methods, to detect malicious webpages in real time. Verma et al. further demonstrated the effectiveness of online learning in URL classification, achieving high accuracy and robustness.

Despite these advancements, existing methods often lack comprehensive solutions to the intertwined challenges of data scarcity, feature representation, and concept drift. The EMO-GANUDF framework bridges this gap by combining feature engineering, semi-supervised GANs, and online learning into a unified system.

The EMO-GANUDF Framework

Overview

The EMO-GANUDF framework consists of two core modules: feature engineering and detection modeling. The feature engineering module extracts multi-dimensional features from URLs, while the detection model integrates semi-supervised learning and GANs to train a robust classifier. The classifier supports online learning, enabling continuous adaptation to new threats.

Data Preprocessing

Raw URL datasets often contain noise, such as ambiguous entries or encoded characters. The preprocessing stage involves:

  1. Filtering Ambiguous URLs: Removing URLs that cannot be definitively classified as malicious or benign.
  2. URL Decoding: Converting encoded characters (e.g., hexadecimal representations) back to their original form.
  3. Non-ASCII Handling: Replacing non-ASCII characters with a standardized placeholder (e.g., “str”).

These steps ensure data consistency and improve feature extraction accuracy.

Feature Engineering

The framework employs three complementary feature extraction methods to capture diverse aspects of URL characteristics:

  1. Statistical Features
    Statistical features are derived from expert knowledge and empirical analysis, focusing on structural and lexical properties of URLs. Key features include:
    • Domain-Level Features: Domain length, number of subdomains, presence of hyphens, and top-level domain (TLD) length. Malicious URLs often exhibit longer domains and multiple subdomains to mimic legitimate sites.

• Path-Level Features: Path length and number of slashes. Attackers frequently manipulate paths to hide malicious payloads.

• Protocol and Query Features: Use of HTTPS (indicating secure connections) and the number of query parameters. Anomalies in query strings may signal SQL injection or XSS attempts.

• URL-Level Features: Total URL length, presence of redirects, and use of URL shortening services. Long URLs or those with redirects are often associated with phishing.

• Character and Keyword Frequencies: Analysis of character distributions and specific keywords (e.g., “login,” “admin,” or SQL syntax) to identify patterns indicative of malicious intent.

A total of 128 statistical features are extracted, providing a robust foundation for classification.

  1. Character Features
    Character features capture the raw ASCII representation of URLs, encoding each character into its corresponding ASCII value (range 33–127). To maintain uniformity, URLs are padded or truncated to a fixed length of 64 characters, and the resulting vectors are normalized. This approach preserves granular character-level information while avoiding dimensionality issues.

  2. Lexical Features
    Lexical features leverage natural language processing techniques to model semantic relationships between URL components. The extraction process involves:

  3. Tokenization: Splitting URLs into meaningful tokens using regular expressions.

  4. Dictionary Construction: Selecting the top 500 most discriminative tokens based on TF-IDF scores.

  5. Word Embedding Training: Using Word2Vec’s skip-gram model to generate 64-dimensional word vectors. The final lexical feature for a URL is the average of its constituent word vectors.

Detection Model

The detection model is built upon MarginGAN, enhanced with an online learning-capable classifier. The model comprises three components:

  1. Generator
    The generator synthesizes realistic URL samples to challenge the discriminator and improve classifier robustness. By minimizing the cross-entropy between pseudo-labels and classifier predictions, the generator produces samples that enhance the classifier’s decision boundaries.

  2. Discriminator
    The discriminator distinguishes between real (labeled and unlabeled) and generated samples. This adversarial training process refines the discriminator’s ability to identify subtle anomalies in URLs.

  3. Classifier
    The classifier is a multi-layer perceptron (MLP) with five hidden layers and an output layer. Each hidden layer has an associated output layer, and predictions are combined using a weighted ensemble. The hedge algorithm dynamically adjusts layer weights based on performance, ensuring optimal utilization of hierarchical features.

Key innovations include:
• Margin-Based Optimization: The classifier maximizes margins for correctly classified samples while minimizing the influence of mislabeled data.

• Online Learning: The classifier updates its parameters incrementally using new data, addressing concept drift without retraining.

Experimental Evaluation

Datasets and Setup

The primary dataset, Malicious URLs, contains 651,191 samples (428,103 benign and 223,088 malicious). Subsets were created to simulate imbalanced and label-scarce scenarios, with varying ratios of benign-to-malicious URLs (1:1, 9:1, 99:1) and labeled data proportions (5%, 10%, 100%). Additional datasets (e.g., Phishing Site URLs, Spam URLs) were used to assess generalization.

Feature Extraction Performance

Experiments compared statistical, character, lexical, and hybrid features across multiple classifiers (e.g., SVM, ET, CNN). Results demonstrated that hybrid features consistently outperformed individual feature sets, achieving 99% accuracy and 84% F1-score on the Malicious URLs dataset.

Model Comparison

EMO-GANUDF was evaluated against baseline models, including:
• Semi-Supervised Models: Co-training and self-training variants.

• GAN-Based Models: SGAN and MarginGAN.

• Traditional Models: Random forests and CNNs.

EMO-GANUDF achieved superior performance, particularly in imbalanced settings (99:1 ratio), with an F1-score of 0.84, compared to 0.80 for SGAN and 0.75 for self-training.

Online Learning Assessment

The classifier’s online learning capability was tested by incrementally introducing new samples. Initial accuracy (77%) improved to 85% after 40,000 updates, demonstrating effective adaptation to concept drift.

Cross-Dataset Validation

Tests on four external datasets confirmed EMO-GANUDF’s generalization, with F1-scores ranging from 0.65 to 0.97. The model excelled in detecting both generic and specialized threats (e.g., phishing, spam).

Conclusion

The EMO-GANUDF framework addresses critical challenges in malicious URL detection by integrating advanced feature engineering, semi-supervised GANs, and online learning. Its hybrid feature extraction method captures diverse URL characteristics, while the MarginGAN-based detection model ensures robustness against imbalanced and label-scarce data. The online learning classifier enables continuous adaptation, making the framework suitable for real-world deployment.

Future work will explore incorporating contextual features (e.g., webpage content) and enhancing performance in extreme data imbalance scenarios. The framework’s modular design allows for seamless integration of new techniques, paving the way for next-generation URL threat detection systems.

doi.org/10.19734/j.issn.1001-3695.2024.04.0212

Was this helpful?

0 / 0