Edge Perturbation-Based Link Prediction Interpretation for Knowledge Graphs
Introduction
Knowledge graphs (KGs) have become a fundamental tool for representing structured information across various domains, including healthcare, finance, and recommendation systems. A KG consists of entities (nodes) and relationships (edges) that connect these entities, typically represented as triples in the form (head entity, relation, tail entity). Link prediction, a critical task in KG completion, aims to infer missing or potential relationships between entities. While deep learning models have significantly improved link prediction accuracy, their “black-box” nature makes it difficult to understand how predictions are derived. This lack of interpretability limits their adoption in high-stakes applications where transparency is essential, such as medical diagnosis or financial decision-making.
To address this challenge, researchers have proposed various interpretability methods for link prediction. However, existing approaches often suffer from limitations such as model dependency, poor generalization, and low accuracy in generated explanations. Some methods are tailored to specific models like graph neural networks (GNNs), while others produce disconnected subgraphs that lack intuitive explanations. Reinforcement learning-based techniques, though effective, are sensitive to sparse data and may generate unreliable paths if the underlying predictions are incorrect.
This article presents Edge Perturbation-Based Link Prediction Interpretation (EP-LPI), a model-agnostic method that generates human-understandable explanations for link prediction results. Unlike previous approaches, EP-LPI does not rely on a specific embedding model and instead focuses on identifying the most influential paths between entities by analyzing edge importance. The method leverages bidirectional search and subgraph construction to efficiently compute explanations while maintaining high accuracy.
Background and Problem Definition
Knowledge Graphs and Link Prediction
A knowledge graph ( G = (E, V, F) ) consists of a set of edges ( E ), nodes ( V ), and facts ( F ), where each fact is represented as a triple ( (h, r, t) ). Link prediction involves predicting the likelihood of a missing relationship ( r ) between two entities ( h ) and ( t ). Most embedding models compute a confidence score ( s(h, r, t) ) indicating the plausibility of the triple. However, these models do not inherently explain why a particular relationship is predicted.
Challenges in Interpretability
Existing interpretability methods face several challenges:
- Model Dependency: Many techniques are designed for specific models (e.g., GNNs) and cannot generalize to other architectures.
- Sparsity and Connectivity: Large KGs often contain sparse regions, making it difficult to find meaningful paths between entities.
- Explanation Quality: Some methods produce subgraphs that are hard to interpret, while others generate paths that may not reflect the true reasoning behind predictions.
EP-LPI addresses these issues by introducing a perturbation-based approach that quantifies the influence of each edge on the prediction and constructs coherent explanation paths.
Methodology
Core Idea: Edge Perturbation for Interpretability
EP-LPI is based on the principle of causal contribution: if removing an edge significantly reduces the confidence score of a predicted triple, that edge is considered important for the prediction. The method calculates the importance of an edge ( e_j ) as the difference in confidence scores before and after its removal:
[ Delta_{delta, ej} = delta{(h, r, t) in G^c setminus {ej}} – delta{(h, r, t) in G^c} ]
Here, ( delta{(h, r, t) in G^c} ) is the confidence score computed on the original subgraph ( G^c ), and ( delta{(h, r, t) in G^c setminus {e_j}} ) is the score after removing ( e_j ).
Key Steps in EP-LPI
-
Subgraph Construction:
• For a given triple ( (h, r, t) ), EP-LPI first extracts all paths between ( h ) and ( t ) using breadth-first search (BFS).• Neighboring nodes along these paths are included to form a training subgraph ( g_{text{train}} ), which captures the local context relevant to the prediction.
-
Edge Importance Calculation:
• The embedding model is fine-tuned on ( g_{text{train}} ).• Each edge in the subgraph is perturbed (removed), and the change in prediction confidence is recorded.
-
Bidirectional Beam Search:
• A greedy algorithm searches for the most influential edges from both ( h ) to ( t ) and ( t ) to ( h ).• The top-( k ) edges with the highest importance scores are selected to form the final explanation path.
This approach ensures that the explanation is both intuitive (a connected path) and faithful (reflecting the model’s reasoning).
Advantages Over Existing Methods
• Model Agnosticism: EP-LPI works with any embedding model (e.g., TransE, RotatE, DistMult) since it treats the model as a black box during perturbation.
• Efficiency: By focusing on local subgraphs, the method reduces computational overhead compared to full-graph approaches.
• Bidirectional Search: Unlike unidirectional methods, EP-LPI leverages inverse relations (e.g., “may treat” vs. “may be treated by”) to improve path discovery in sparse regions.
Experiments and Results
Datasets
EP-LPI was evaluated on two datasets:
- Family-rr: A KG containing familial relationships, used to validate explanation accuracy against ground-truth paths.
- BIOS: A biomedical KG with millions of triples, used to test real-world applicability in drug repurposing.
Comparative Performance
EP-LPI was compared against six state-of-the-art interpretability methods:
- PGPR: A reinforcement learning-based path search method.
- ELPE: Combines graph transformers with reinforcement learning.
- CRIAGE: Uses sensitivity analysis to identify influential facts.
- CAFE: Leverages user profiles to guide path search.
- KGEP: Identifies critical entities and edges.
- PaGE-Link: Applies masking to prune irrelevant edges.
On Family-rr, EP-LPI achieved:
• Accuracy (ACC): 63.71%, outperforming the next best method (KGEP at 62.27%).
• Area Under the Precision-Recall Curve (AUPR): 76.31%, a 1.9% improvement over KGEP.
These results demonstrate that EP-LPI generates more accurate explanations than methods relying on reinforcement learning or subgraph extraction.
Model Compatibility
Tests with four embedding models (TransE, RotatE, DistMult, ComplEx) showed consistent performance, confirming EP-LPI’s model-agnostic nature. RotatE yielded the best results, but all models benefited from the bidirectional search strategy.
Case Study: Drug Repurposing
In BIOS, EP-LPI generated explanations for drug-disease predictions:
- Tacrolimus → Atopic Dermatitis: The path showed that tacrolimus treats lichen simplex, a condition similar to atopic dermatitis, justifying its repurposing potential.
- Cyclosporine → Atopic Dermatitis: The explanation linked cyclosporine’s use in pediatric atopic dermatitis to adult cases.
- Vancomycin → Cholangitis: The path highlighted vancomycin’s role in combination therapy with piperacillin for infections like cholangitis.
- Certolizumab → Inflammatory Bowel Disease (IBD): The explanation connected certolizumab to infliximab, a known IBD treatment.
These examples illustrate how EP-LPI provides actionable insights for medical professionals by tracing the reasoning behind predictions.
Discussion
Strengths of EP-LPI
- Interpretability: The method produces coherent paths instead of disconnected subgraphs, making explanations easier to understand.
- Scalability: Subgraph construction and bidirectional search make the approach feasible for large KGs.
- Flexibility: The framework can be adapted to various domains, from healthcare to finance.
Limitations and Future Work
- Quantitative Evaluation: Currently, explanations are assessed qualitatively or against ground-truth paths. Developing metrics to quantify explanation quality is an open challenge.
- Path Length: The method assumes that shorter paths are more interpretable, but some domains may require longer reasoning chains.
- Dynamic KGs: EP-LPI does not yet handle temporal or evolving graphs, which are common in real-world applications.
Future research could explore:
• Integrating logic-based rules to validate explanation paths.
• Extending the method to dynamic KGs for applications like fraud detection.
• Combining EP-LPI with natural language generation to produce textual explanations.
Conclusion
EP-LPI advances the interpretability of link prediction models by introducing a perturbation-based approach that identifies influential edges and constructs meaningful explanation paths. The method is model-agnostic, efficient, and capable of handling sparse KGs. Experiments on familial and biomedical datasets demonstrate its superiority over existing techniques in accuracy and practical utility. By making predictions more transparent, EP-LPI enhances trust in KG-based systems and supports decision-making in critical domains like healthcare.
For further details, refer to the original work at doi.org/10.19734/j.issn.1001-3695.2024.07.0287.
Was this helpful?
0 / 0