Introduction
The advancement of information technology has significantly transformed the healthcare sector, with many medical institutions adopting intelligent processes and online consultation channels to improve accessibility to medical resources. Medical consultation systems involve complex knowledge systems with intricate relationships, necessitating structured parsing and storage of authoritative medical texts (e.g., medical books) and patient-related information (e.g., patient statements and consultation records). Structured parsing enables the extraction of medical concepts and their relationships, facilitating knowledge updates and iterative improvements in the field.
Traditional approaches for structured parsing in medical consultation systems rely on rule-based methods, traditional machine learning algorithms, or pipeline-based techniques. However, these methods suffer from high manual costs, poor generalization, error propagation, and low recall rates. To address these challenges, this paper proposes a structured parsing method that integrates prior knowledge into deep learning models, enhancing their ability to understand medical texts and extract relevant concepts and relationships.
Prior Knowledge Acquisition and Resource Storage
Resource Acquisition and Repository Construction
Prior knowledge in medical consultation systems provides semantic and structural insights, enabling algorithms to learn disease- and symptom-related features. This paper focuses on diabetes as a case study, extracting resources from authoritative medical books and patient statements. The structured storage format used is XML (eXtensible Markup Language), which organizes data hierarchically. For example, medical books are parsed into tree-like structures where the root represents the book title, and subsequent nodes correspond to chapters and sections.
Patient statements, including self-reports and doctor-patient dialogues, are also structured using XML schemas. These resources are categorized into symptoms, diseases, examinations, medications, and other relevant fields to support concept and relationship extraction.
Construction of Prior Knowledge Dictionaries
A key component of prior knowledge is the domain-specific dictionary, which contains medical concepts such as symptoms, diseases, and treatments. The dictionary is constructed using:
- Standardized Medical Terminologies: Sources like Common Clinical Medical Terms (2019 Edition) and Chinese-English Diagnostic Dictionary provide structured terms.
- Semi-Structured Medical Encyclopedias: Websites such as medical question-and-answer platforms and symptom databases are scraped to extract relevant terms.
- Patient Statements: These are annotated with additional fields (e.g., symptom severity, duration, and negation) to capture nuanced medical expressions.
The resulting dictionary includes categories such as symptoms (SYM), diseases (DIS), examinations (CHK), medications (DRG), and temporal information (TIM). This dictionary supports remote supervision for annotating training data, reducing reliance on manual labeling.
Knowledge Network Generation
To extract relationships between concepts, this paper introduces a Dynamic Window Algorithm for Relation Extraction (DWARE). Unlike fixed-window methods, DWARE dynamically adjusts the context window based on the relationship type between two concepts (e.g., symptom-disease or symptom-medication). The algorithm processes text by:
- Sentence Segmentation: Dividing documents into sentences.
- Tokenization and Concept Extraction: Identifying medical concepts using the prior dictionary.
- Dynamic Window Adjustment: Determining the relationship span between concepts based on predefined rules.
The extracted relationships are stored as triples (e.g., ), forming a knowledge network where nodes represent concepts and edges represent relationships. This network serves as training data for relationship extraction models.
Prior Knowledge-Enhanced Structured Parsing Models
MedReBERT: A Prior Knowledge-Enhanced BERT Model
BERT (Bidirectional Encoder Representations from Transformers) is a powerful pre-trained language model but lacks domain-specific medical knowledge. To enhance its performance, this paper introduces MedReBERT, which integrates medical prior knowledge through two tasks:
- Masked Language Modeling (MLM): Randomly masking words in medical texts and predicting them based on context.
- Entity Ranking Task: Replacing correct medical concepts with incorrect ones and training the model to rank the correct concept higher using a margin ranking loss.
By jointly optimizing these tasks, MedReBERT learns to associate medical terms with their correct contexts, improving concept and relationship extraction.
Concept Indexing Model
Traditional sequence labeling models require large annotated datasets. Instead, this paper reformulates concept extraction as a cloze-style task, where the model predicts the category of a masked concept in a given sentence. For example:
- Input Template: “Patient has [MASK]MASK.”
- Prediction: The model fills the mask with the correct concept category (e.g., “hyperglycemia” → “SYM”).
This approach leverages pre-trained knowledge, enabling accurate concept extraction even with limited labeled data.
Concept Relationship Indexing Model
Rather than treating relationship extraction as a separate pipeline, this paper unifies concept and relationship extraction using a text generation framework. Three templates are designed:
- Template 1: “Concept A and Concept B have a [MASK][MASK] relationship.”
- Template 2: “Concept A’s [MASK][MASK] is Concept B.”
- Template 3: “The relationship between Concept A and Concept B is [MASK][MASK].”
The model predicts the masked relationship (e.g., “treatment” or “symptom”), reducing error propagation from concept extraction.
Engineering and Algorithm Collaborative Learning Framework
Framework Design
To streamline the structured parsing workflow, this paper proposes a collaborative framework integrating engineering processes and algorithmic training. Key components include:
- Resource Acquisition Platform: Crawls and processes raw medical texts.
- Knowledge Enhancement Plugins: Applies prior dictionaries and rules to generate training data.
- Algorithm Training Interface: Enables model training via RESTful APIs, facilitating iterative improvements.
This framework ensures seamless integration between data preprocessing, model training, and deployment, reducing maintenance costs and improving scalability.
RESTful API-Based Service
The framework employs FastAPI to provide RESTful web services, allowing users to:
- Submit raw text for parsing.
- Train and update models via standardized interfaces.
- Retrieve structured outputs (e.g., concepts and relationships) in JSON format.
This approach ensures interoperability and ease of integration with existing medical systems.
Experimental Results
Concept Extraction Performance
Experiments compare the proposed MedReBERT with baseline models (BERT and ERNIE) on small annotated datasets. Results show:
- MedReBERT achieves F1 scores of 0.86–0.91, significantly outperforming BERT (0.05–0.16) and ERNIE (0.12–0.19).
- The cloze-style task effectively leverages prior knowledge, enabling high accuracy with minimal training data.
Relationship Extraction Performance
Evaluation on medical books and patient statements reveals:
- Medical Books: MedReBERT achieves F1 scores of 0.83–0.94, demonstrating strong performance on structured texts.
- Patient Statements: Performance is lower (F1: 0.66–0.67) due to sparse concept relationships in informal language.
Compared to rule-based methods (F1: 0.48), MedReBERT improves relationship extraction by 13%–27%, highlighting the benefits of prior knowledge integration.
Conclusion
This paper presents a structured parsing method for medical consultation systems, combining prior knowledge with deep learning to extract concepts and relationships. Key contributions include:
- Prior Knowledge Integration: MedReBERT enhances BERT with medical dictionaries and entity ranking.
- Cloze-Style Concept Extraction: Reformulates labeling as a text generation task, reducing reliance on annotated data.
- Unified Relationship Extraction: Avoids pipeline errors by jointly predicting concepts and relationships.
- Collaborative Framework: Streamlines algorithm deployment and iteration via RESTful APIs.
Future work will explore large-scale pre-trained models for long-text understanding and expand applications to electronic health records and clinical notes.
DOI: 10.19734/j.issn.1001-3695.2024.07.0263
Was this helpful?
0 / 0