Multi-modal Entity Alignment Model Based on Adaptive Fusion Technology
Introduction
Knowledge graphs (KGs) have become a fundamental tool for structuring and storing knowledge in large databases. They represent entities and their relationships through structured triples, providing a clear and intuitive way to capture attributes and connections between entities. Traditional KGs primarily rely on textual data, but the integration of multi-modal data—such as images—has significantly enhanced their richness and applicability. Multi-modal knowledge graphs (MMKGs), such as MMKG and RichPedia, incorporate visual data alongside structured triples, offering a more comprehensive representation of knowledge.
Entity alignment (EA) is a critical task in knowledge graph integration, aiming to identify equivalent entities across different KGs. This process addresses challenges such as varying naming conventions, multilingualism, and heterogeneous graph structures. Multi-modal entity alignment (MMEA) extends this task by leveraging visual content associated with entities, providing supplementary information beyond textual attributes. However, existing MMEA approaches face two major challenges: (1) modality imbalance, where certain modalities may be missing or noisy, and (2) difficulties in effectively integrating diverse modalities due to representation misalignment.
To address these challenges, this paper introduces the Multi-modal Adaptive Contrastive Learning for Entity Alignment (MACEA) model. MACEA employs a multi-modal variational autoencoder to actively complete missing modality information, a dynamic modality fusion method to integrate and complement different modalities, and an inter-modal contrastive learning technique to model interactions between modalities. These innovations enable MACEA to achieve superior performance in entity alignment tasks, as demonstrated by significant improvements in key evaluation metrics.
Background and Related Work
Traditional Entity Alignment
Entity alignment techniques have evolved from early rule-based methods to modern embedding-based approaches. Traditional methods often rely on structural similarities between entities, such as shared attributes or relationships. However, these methods struggle with heterogeneous KGs where structural differences are pronounced.
Embedding-based methods address this limitation by projecting entities into a low-dimensional vector space where similarity can be measured more effectively. Two primary approaches dominate this space:
- Translation-based methods, such as TransE, model relationships as translations in the embedding space, capturing structural patterns in triples.
- Graph neural network (GNN)-based methods, including Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), aggregate neighborhood information to enrich entity representations.
Advanced techniques like parameter sharing and iterative learning further refine alignment accuracy by leveraging pseudo-labeled entity pairs or shared embeddings across KGs.
Multi-modal Entity Alignment
The integration of visual data into entity alignment has opened new avenues for improving accuracy. Early MMEA models focused on simple fusion strategies, such as concatenating visual and structural embeddings. However, these approaches often fail to account for modality-specific characteristics or imbalances.
Recent advancements have introduced more sophisticated fusion mechanisms:
• EVA employs modality-specific attention weights to dynamically adjust the importance of each modality.
• MSNEA uses visual features to guide the learning of relational and attribute embeddings while applying contrastive learning to enhance intra-modal representations.
• MCLEA minimizes the distribution gap between joint and single-modality embeddings using Kullback-Leibler (KL) divergence.
Despite these innovations, existing models still struggle with modality noise, missing data, and the complexity of inter-modal interactions. MACEA addresses these gaps by introducing adaptive fusion and contrastive learning techniques.
Methodology
Overview
MACEA consists of three core components:
- Multi-modal Variational Autoencoder (MVAE): Completes missing modality information to ensure robustness against incomplete data.
- Dynamic Modality Fusion (DMF): Adaptively weights and combines modalities based on their relevance and quality.
- Inter-modal Contrastive Learning (IAL): Aligns representations across modalities to reduce distribution gaps.
Multi-modal Knowledge Embedding
MACEA encodes entities using multiple modality-specific encoders:
Graph Neighborhood Structure Embedding
The model uses GATv2, an improved version of Graph Attention Networks, to capture structural information. GATv2 computes attention weights between entities, aggregating neighborhood features to generate enriched structural embeddings. This approach is particularly effective for complex relational patterns.
Relation, Attribute, and Name Embedding
Textual information—such as entity names, attributes, and relations—is encoded using pre-trained word embeddings (e.g., GloVe) and processed through feedforward layers. These embeddings are treated as bag-of-words features to maintain consistency across modalities.
Visual Information Embedding
Visual data is encoded using pre-trained models like ResNet-152 or CLIP. The final layer outputs of these models are transformed into visual embeddings through a feedforward layer, ensuring compatibility with other modalities.
Dynamic Modality Fusion
To address the challenge of integrating diverse modalities, MACEA introduces Dynamic Modality Fusion (DMF). DMF computes global weights for each modality, reflecting its importance in the alignment task. The fused embedding is a weighted concatenation of individual modality embeddings, allowing the model to emphasize high-quality modalities while suppressing noisy or redundant ones.
A contrastive learning framework further refines the fusion process by maximizing the similarity between aligned entity pairs and minimizing it for negative samples. This ensures that the joint embedding space preserves both structural and semantic relationships.
Inter-modal Contrastive Learning
Inter-modal Contrastive Learning (IAL) is designed to align representations across modalities. By minimizing the KL divergence between joint and single-modality embeddings, IAL ensures that complementary information is effectively shared. This component is crucial for handling scenarios where one modality may dominate or where modalities exhibit significant distribution gaps.
Modality Completion via MVAE
The Multi-modal Variational Autoencoder (MVAE) addresses missing modalities by reconstructing absent data from available sources. The model optimizes a combination of reconstruction loss and KL divergence, ensuring that the latent space approximates a Gaussian distribution. This enables robust performance even when certain modalities are unavailable.
Experiments
Datasets
MACEA is evaluated on three benchmark datasets:
- DBP15K: A cross-lingual entity alignment dataset with subsets for Chinese-English (ZH-EN), Japanese-English (JA-EN), and French-English (FR-EN) pairs.
- FBDB15K and FBYG15K: Multi-modal datasets derived from Freebase and DBpedia/YAGO, featuring varying proportions of pre-aligned entity pairs (20%, 50%, and 80%).
Implementation Details
The model uses a hidden dimension of 300 and is trained for 500 epochs with early stopping. The AdamW optimizer is employed with a batch size of 3500. CLIP serves as the visual encoder, and textual features are processed using GloVe embeddings.
Evaluation Metrics
Performance is measured using:
• Hits@n: The probability that the correct entity appears in the top-n ranked candidates.
• Mean Reciprocal Rank (MRR): The average reciprocal rank of the correct entity.
• Mean Rank (MR): The average rank of the correct entity (lower is better).
Results
Main Experiments
MACEA outperforms baseline models (EVA, MSNEA, and MCLEA) across all datasets. On DBP15K, MACEA achieves a Hits@1 of 0.758 (FR-EN), 0.727 (ZH-EN), and 0.731 (JA-EN), representing improvements of 5.71%, 1.67%, and 2.23% over MCLEA, respectively. Similar gains are observed for MRR and Hits@10.
On FBDB15K and FBYG15K, MACEA demonstrates even more significant improvements, particularly in low-resource settings (Seed=0.2). For example, Hits@1 improves by 29.92% and 39.59% over MCLEA on these datasets, highlighting the model’s robustness to sparse alignment seeds.
Iterative Training
When augmented with iterative training, MACEA further enhances its performance. For instance, on DBP15K (FR-EN), the model achieves a Hits@1 of 0.811 without surface forms and 0.995 with surface forms, demonstrating the effectiveness of pseudo-labeling and incremental learning.
Ablation Studies
Ablation experiments confirm the contributions of each MACEA component:
• Removing MVAE (MACEA/del VAE) reduces performance by 1.38%, underscoring its role in handling missing modalities.
• Disabling DMF (MACEA/del DMF) leads to a 1.6% drop in Hits@1, highlighting the importance of adaptive fusion.
• Excluding IAL (MACEA/del IAL) results in marginal degradation, suggesting that while contrastive learning is beneficial, the model can partially compensate through other mechanisms.
Discussion
MACEA’s success can be attributed to its holistic approach to multi-modal challenges:
- Modality Completion: MVAE ensures robustness against missing data, a common issue in real-world KGs.
- Adaptive Fusion: DMF dynamically balances modality contributions, mitigating noise and redundancy.
- Contrastive Alignment: IAL bridges representation gaps, enabling effective knowledge transfer across modalities.
Comparisons with other models reveal key insights:
• ACK-MMEA focuses heavily on attribute consistency but underperforms by 15.11% in Hits@1 due to neglecting relational and visual cues.
• MMEA struggles with modality imbalance, resulting in a 19.39% lower Hits@1 than MACEA.
• MEAformer achieves competitive results but lacks explicit mechanisms for handling missing data, leading to a 1.28% performance gap.
Conclusion
MACEA represents a significant advancement in multi-modal entity alignment by addressing modality imbalance and fusion challenges. Its integration of MVAE, DMF, and IAL provides a robust framework for leveraging diverse data sources, achieving state-of-the-art performance across multiple benchmarks. Future work could explore extending these techniques to other modalities (e.g., audio or video) and refining the fusion process for even greater accuracy.
doi.org/10.19734/j.issn.1001-3695.2024.05.0187
Was this helpful?
0 / 0