Integrating Contrastive Learning and Dual-Stream Networks for Knowledge Graph Summarization Models
The rapid development of the internet has led to an explosive growth of information, making it increasingly difficult for individuals to identify and filter valuable content. In this era of information overload, summarization has emerged as a crucial tool for quickly assessing the value of documents or articles, helping users decide whether to engage in deeper reading. However, not all texts come with manually written summaries, highlighting the urgent need for research into automatic summarization algorithms.
Text summarization algorithms are broadly categorized into extractive and abstractive approaches. Extractive summarization involves selecting key sentences or text units from the original content and reorganizing them to form a summary. While this method is straightforward, it often produces summaries that lack fluency and coherence. In contrast, abstractive summarization employs natural language generation techniques to create summaries that may include expressions not explicitly present in the source text, resulting in more natural and contextually rich outputs. With advancements in deep learning, pretrained models such as BERT and GPT have achieved remarkable success in natural language processing (NLP). Consequently, leveraging these models for downstream tasks like text summarization has become increasingly popular.
Pretrained models have significantly enhanced the ability to represent text, leading to more precise and concise summaries. However, despite these improvements, challenges remain in fully capturing key information and maintaining factual consistency. For instance, these models may omit critical details or produce summaries with subject-verb disagreements, undermining the reliability of the generated content.
Knowledge graphs offer a structured and efficient way to represent data and have been widely applied in NLP tasks, particularly in text summarization. Unlike purely text-based approaches, knowledge graphs extract entities and their relationships, providing richer semantic information that enhances the accuracy of summaries. For example, analyzing relationships between people, geographical locations, and other key factors helps identify core content, leading to more precise and comprehensive summaries. This approach not only improves the relevance of summaries but also enhances readability and utility.
Knowledge graph encoders convert structured knowledge into feature representations of entities, while text encoders process unstructured textual data. These two components can be viewed as distinct modalities. Models that integrate knowledge graphs with text summarization aim to improve performance by learning multimodal representations of both text and knowledge graphs. Current approaches typically employ separate encoders for text and knowledge graphs, followed by feature fusion.
However, this multimodal framework faces several challenges:
- Feature Space Misalignment: Knowledge graph entity features and text features reside in different spaces, making effective fusion difficult.
- Semantic Deviation: Knowledge graphs are often constructed using tools like OpenIE, which may extract information that deviates from the original text’s meaning. For example, the sentence “He prefers to play football” might be incorrectly parsed as the triple (He, play, football), losing the nuance of preference.
- Limited Entity Features: When knowledge graphs lack sufficient entity information, the performance of fusion-based summarization models suffers.
To address these issues, this study introduces a novel summarization model, KGDR-CLSUM, which integrates contrastive learning with a dual-stream network. The model first encodes unstructured text and structured knowledge separately using dedicated encoders. Then, an alignment strategy based on contrastive learning and dual-stream networks ensures that the encoded entity and text features are properly aligned. Finally, a multimodal encoder fuses these features to generate summaries.
To mitigate semantic deviations introduced by OpenIE and similar tools, the model incorporates a momentum distillation strategy. During training, pseudo-targets generated by the momentum distillation model serve as additional supervision signals, helping the model overcome noise in the extracted knowledge. For instance, in the earlier example, the momentum distillation model might generate the pseudo-target (He, enjoys, football), which better preserves the original meaning.
Related Work
Over the past decade, abstractive text summarization has seen significant advancements. In 2015, Rush et al. introduced the first sequence-to-sequence (seq2seq) model for abstractive summarization, demonstrating promising results on the Gigaword and DUC-2004 datasets. However, the model struggled with long texts and complex sentence structures, highlighting limitations in capturing deep semantic information.
To address these shortcomings, Chopra et al. incorporated attention mechanisms with recurrent neural networks (RNNs), improving the fluency and coherence of generated summaries. Later, See et al. proposed a pointer-generator network to handle out-of-vocabulary words and repetition issues, achieving strong performance on the CNN/Daily Mail dataset. Despite these improvements, challenges remained in maintaining logical consistency and factual accuracy, particularly for long and complex texts.
The introduction of Transformer architectures marked a turning point in text summarization. Models like T5, BART, and PEGASUS leveraged self-attention mechanisms to achieve superior performance in various NLP tasks. PEGASUS, in particular, excelled in summarization by generating coherent and precise summaries. However, these models still faced issues with hallucination and semantic inconsistencies, underscoring the need for more robust solutions.
Knowledge graph-enhanced summarization models emerged as a promising direction. Fernandes et al. applied graph neural networks to structured summarization, while Kryscinski et al. demonstrated the potential of integrating knowledge graphs with Transformer models. Zhu et al. further improved factual consistency by incorporating graph attention mechanisms and an error-correction module. Huang et al. introduced ASGARD, a knowledge-graph-enhanced framework that improved summary accuracy through cloze-style training.
Despite these advances, existing models struggled with aligning knowledge graph features with textual semantics, often resulting in incomplete or inaccurate summaries. Additionally, they did not adequately address scenarios where structured knowledge was sparse. These challenges motivated the development of KGDR-CLSUM, which introduces momentum distillation and feature alignment techniques to enhance summarization quality.
Model Architecture
KGDR-CLSUM consists of several key components:
- Knowledge Graph Encoder: Processes structured knowledge using graph attention networks (GAT).
- Text Encoder: Encodes unstructured text using a Transformer-based architecture.
- Dual-Stream Network: Facilitates interaction between text and knowledge graph features.
- Contrastive Learning Module: Aligns features from different modalities.
- Momentum Distillation Model: Generates pseudo-targets to mitigate noise in knowledge graphs.
- Multimodal Encoder and Decoder: Fuses aligned features and generates summaries.
The model employs a Transformer-based architecture due to its ability to capture long-range dependencies and process information in parallel. The knowledge graph encoder is deeper than the text encoder to ensure robust feature extraction, while the multimodal encoder is designed to facilitate thorough fusion.
Knowledge Graph and Text Feature Representation
The text encoder is based on PEGASUS, a seq2seq pretrained model with a 12-layer Transformer encoder. For efficiency, the first six layers are used to extract text features. Special tokens [CLS] and [SEP] mark the beginning and end of input text, with the [CLS] token’s hidden state serving as the output representation.
The knowledge graph encoder consists of an offline module and a four-layer GAT. The offline module uses OpenIE to extract triples (subject, relation, object) from the text, constructing a knowledge graph where nodes represent entities and edges denote relationships. BERT is used as a frozen embedding layer to initialize node representations, with [CLS] tokens capturing global features. The GAT refines these representations using attention mechanisms to aggregate information from neighboring nodes.
Feature Alignment via Contrastive Learning
Unlike traditional early or late fusion approaches, KGDR-CLSUM aligns knowledge graph and text features before fusion. The dual-stream network employs cross-attention mechanisms to compute interactions between modalities:
• Entity Cross-Attention: Uses knowledge graph features to guide text feature fusion.
• Text Cross-Attention: Uses text features to guide knowledge graph feature fusion.
Contrastive learning ensures that positive pairs (aligned features) have high similarity while negative pairs (misaligned features) have low similarity. This strategy improves multimodal representations and mitigates limitations in knowledge graph features.
Momentum Distillation
Knowledge graphs constructed via OpenIE often contain noise. The momentum distillation model generates pseudo-targets during training, providing additional supervision to reduce noise. For example, if OpenIE extracts (He, play, football), the momentum model might generate (He, enjoys, football), preserving the original intent. The momentum model updates its parameters using exponential moving averages, ensuring stable training.
Multimodal Fusion and Summary Generation
The aligned features are fed into a multimodal encoder, which uses cross-attention to fuse text and knowledge graph representations. The decoder, based on PEGASUS, generates summaries by maximizing the probability of correct tokens while minimizing contrastive and distillation losses.
Experimental Results
KGDR-CLSUM was evaluated on the CNN/Daily Mail and XSum datasets, which differ in summary style (extractive vs. abstractive) and text length. The model outperformed baseline models, including PEGASUS, BART, and knowledge-graph-enhanced models like FASum and KSDASum.
On CNN/Daily Mail, KGDR-CLSUM improved ROUGE-1, ROUGE-2, and ROUGE-L scores by 3.03%, 3.42%, and 2.56%, respectively, over PEGASUSBASE. On XSum, the improvements were even more pronounced (7.54%, 8.78%, and 8.51%). Human evaluations also rated KGDR-CLSUM higher than ChatGPT, particularly in factual consistency and fluency.
Ablation studies confirmed the contributions of each module:
• Removing contrastive learning caused the largest performance drop, underscoring its importance in feature alignment.
• Omitting the dual-stream network also significantly reduced performance.
• Momentum distillation had a smaller but still meaningful impact, primarily in reducing noise.
Conclusion
KGDR-CLSUM introduces a novel framework for knowledge-graph-enhanced summarization, addressing key challenges in feature alignment and noise reduction. By integrating contrastive learning, dual-stream networks, and momentum distillation, the model generates more accurate and fluent summaries, particularly for abstractive tasks. Future work will focus on improving performance for extractive-style datasets and exploring additional applications of momentum distillation in NLP.
doi.org/10.19734/j.issn.1001-3695.2024.07.0304
Was this helpful?
0 / 0