Frontier Video Anomaly Detection Methods Based on Deep Learning: A Comprehensive Review

Frontier Video Anomaly Detection Methods Based on Deep Learning: A Comprehensive Review

Introduction

Video anomaly detection (VAD) has emerged as a significant research topic in computer vision, with substantial implications for intelligent surveillance systems. The rapid urbanization and increasing population density worldwide have led to a surge in public safety concerns, necessitating advanced monitoring solutions. Traditional surveillance systems rely heavily on manual video analysis, which becomes inefficient and costly when handling massive volumes of video data. Consequently, intelligent surveillance systems capable of autonomously detecting and interpreting abnormal events are in high demand.

Deep learning, particularly convolutional neural networks (CNNs), has demonstrated remarkable success in various computer vision tasks, including object detection, action recognition, and video captioning. This success has inspired numerous deep learning-based approaches for video anomaly detection. Existing surveys on VAD have limitations, such as focusing only on specific detection strategies or lacking coverage of recent advancements. This article provides a systematic and comprehensive review of deep learning-based VAD methods, categorizing them based on detection strategy, sample setting, and learning/inference mechanisms. Additionally, we discuss benchmark datasets, performance comparisons, and future research directions.

Supervised Video Anomaly Detection Methods

Supervised VAD methods leverage labeled training data to distinguish between normal and abnormal events. Early approaches employed binary classification, where CNNs were trained on both normal and abnormal samples. However, these methods suffer from two major limitations: (1) they operate in a closed-set setting, meaning they can only detect known anomaly types present in the training data, and (2) they require strong supervision with fine-grained annotations, increasing labeling costs.

Open-Set Supervised Methods

To address the closed-set limitation, recent studies have explored open-set supervised methods that can detect unknown anomaly types. One approach involves margin learning, where a triplet loss is used to enforce compact feature distributions for normal samples while increasing the separation between normal and known abnormal samples. Another method integrates variational normal inference to enhance boundary learning. Additionally, specialized datasets like UBnormal have been introduced to facilitate fair comparisons between open-set and closed-set approaches.

Weakly Supervised Methods

Weakly supervised methods reduce annotation costs by using video-level labels instead of frame-level annotations. A prominent framework is multiple instance learning (MIL), where videos are treated as “bags” containing multiple segments (“instances”). The MIL ranking loss ensures that the highest-scoring segment in an abnormal video bag exceeds that in a normal video bag. Subsequent improvements include incorporating temporal context, refining loss functions, and enhancing feature representations.

Alternative weakly supervised approaches include alternating optimization frameworks that iteratively refine pseudo-labels using graph convolutional networks (GCNs) and self-training strategies. Some methods combine open-set and weakly supervised learning to detect unknown anomalies with minimal labeling effort.

Semi-Supervised Video Anomaly Detection Methods

Semi-supervised methods rely solely on normal samples for training, making them more practical for real-world applications where labeled anomalies are scarce. These methods typically use autoencoders or generative adversarial networks (GANs) to model normal event distributions and identify deviations as anomalies.

Improved Autoencoder/GAN-Based Methods

Reconstruction-Based Methods

Traditional convolutional autoencoders (CAEs) often generalize too well, reconstructing anomalies with low error. To mitigate this, memory-augmented autoencoders (MemAE) store typical normal patterns in a memory module, ensuring that only normal features are reconstructed. Other enhancements include skip connections, multi-level memory modules, and hybrid architectures that combine reconstruction with prediction.

Prediction-Based Methods

Prediction-based approaches forecast future frames using past observations. Abnormal events lead to higher prediction errors due to their deviation from learned normal patterns. Advanced techniques include hierarchical spatio-temporal graph convolutional networks for skeleton-based anomaly detection and transformer-based models for improved temporal modeling. Some methods employ dual discriminators to assess both spatial and temporal consistency.

Hybrid Reconstruction-Prediction Methods

Several studies combine reconstruction and prediction to improve anomaly detection. For instance, one framework uses reconstructed optical flow as a condition for frame prediction, where poor flow reconstruction exacerbates prediction errors for anomalies. Another approach employs separate branches for reconstruction and prediction, fusing their outputs for final anomaly scoring.

Discriminator-Based Methods

Instead of relying solely on reconstruction or prediction errors, some methods leverage GAN discriminators to measure the similarity between generated and real samples. Improved variants include Wasserstein GANs for stable training and self-supervised discriminators that classify rotated frames to enhance discrimination.

Probability and Decision Model-Based Methods

These methods model the latent feature distributions of normal samples. Gaussian mixture models (GMMs) and variational autoencoders (VAEs) estimate normal feature densities, while one-class support vector machines (OC-SVMs) and support vector data description (SVDD) define decision boundaries around normal samples.

Few-Shot Learning Methods

Few-shot learning techniques adapt anomaly detection models to new scenarios with limited labeled data. Transfer learning approaches fine-tune pre-trained models on target domains, while meta-learning frameworks learn to quickly adapt to new tasks using episodic training.

Unsupervised Video Anomaly Detection Methods

Unsupervised methods eliminate the need for labeled training data, making them highly practical for real-world deployment. These approaches typically generate pseudo-labels or synthetic samples to train detection models.

Pseudo-Label Generation Methods

Initial pseudo-labels are obtained using unsupervised detectors like isolation forests or subspace clustering. These labels then train a supervised model, which is refined iteratively through self-training. For example, one method updates pseudo-labels by selecting high-confidence normal and abnormal samples based on the current model’s predictions.

Pseudo-Sample Construction Methods

Self-supervised learning defines auxiliary tasks to create pseudo-samples. One approach uses temporal consistency to identify normal and abnormal frames, while another designs spatial-temporal jigsaw puzzles to learn discriminative features. Contrastive learning frameworks further enhance feature separation between normal and anomalous events.

Benchmark Datasets for Video Anomaly Detection

Several datasets are widely used to evaluate VAD methods:

• UCSD: Contains pedestrian scenes with anomalies like bikes and skateboards.

• CUHK-Avenue: Features simulated anomalies such as running and throwing objects.

• ShanghaiTech: Includes diverse outdoor scenes with complex anomalies.

• UCF-Crime: A large-scale dataset with real-world criminal activities.

• Street-Scene: Focuses on traffic violations in a community setting.

Performance comparisons show that supervised methods achieve the highest accuracy but require extensive labeling. Semi-supervised approaches offer a balance between performance and practicality, while unsupervised methods are catching up in effectiveness.

Future Research Directions

Multi-Modal Fusion

Integrating video, audio, and infrared data can improve detection under varying lighting and weather conditions. Audio cues help identify events like screams, while infrared data aids in low-light scenarios.

3D Scene Understanding

Incorporating depth information and 3D spatial reasoning can enhance anomaly detection by addressing scale variations and occlusions.

Semantic Interpretability

Advanced video understanding models can provide semantic explanations for detected anomalies, improving transparency and usability.

Adaptive Scene Perception

Online learning techniques can enable models to adapt dynamically to new environments, leveraging scene context for better anomaly discrimination.

Conclusion

This review systematically categorizes and analyzes deep learning-based VAD methods, highlighting their strengths and limitations. While significant progress has been made, challenges remain in handling complex real-world scenarios. Future work should focus on multi-modal fusion, 3D reasoning, interpretability, and adaptive learning to advance the field further.

For more details, refer to the original paper: https://doi.org/10.19734/j.issn.1001-3695.2024.06.0241

Was this helpful?

0 / 0