Medical Imaging Report Generation via Multi-Modal Large Language Models with Discrimination-Enhanced Fine-Tuning

Medical Imaging Report Generation via Multi-Modal Large Language Models with Discrimination-Enhanced Fine-Tuning

Introduction

Automated medical imaging report generation has emerged as a crucial technology to enhance radiologists’ efficiency in clinical practice. Traditional approaches for generating medical imaging reports primarily rely on either classification-based methods or image captioning models. While these methods have shown some success, they often suffer from limitations in accuracy, fluency, and diversity. The advent of large language models (LLMs) and multi-modal architectures presents a promising solution to these challenges by leveraging their powerful semantic understanding and generation capabilities.

This paper introduces MedVLM, a novel discrimination-enhanced fine-tuning method for medical imaging report generation, built upon the pre-trained multi-modal VisualGLM-6B model. The proposed approach integrates diagnostic classification as an auxiliary objective during fine-tuning, enhancing both the quality of generated reports and the model’s diagnostic accuracy. By employing parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA), P-Tuning V2, and freeze fine-tuning, MedVLM optimizes the feature extraction, vision-language alignment, and language generation modules of the underlying multi-modal LLM. The experimental results demonstrate significant improvements over traditional image captioning methods, achieving superior performance in both text generation metrics and diagnostic accuracy for pneumonia detection in lung CT scans.

Background and Motivation

Medical imaging reports play a vital role in clinical decision-making by providing objective descriptions and diagnostic interpretations of radiological findings. However, the increasing volume of medical imaging data has outpaced the growth of radiologists, creating a pressing need for automated tools to assist in report generation. Existing methods can be broadly categorized into three groups:

  1. Discriminative Methods: These approaches rely on structured or semi-structured templates to generate reports. While simple to implement, they lack flexibility and diversity due to the need for disease-specific templates.
  2. Image Captioning Models: These encoder-decoder architectures generate free-text reports but often struggle with accuracy and clinical relevance, particularly in medical domains where subtle abnormalities require precise descriptions.
  3. Pre-trained Multi-modal LLMs: Recent advancements in vision-language models, such as BLIP-2 and VisualGLM, offer a more robust solution by leveraging large-scale pre-training and fine-tuning for domain-specific tasks.

Inspired by radiologists’ workflow—where image interpretation and diagnosis are interdependent—MedVLM incorporates a discrimination-enhanced module to improve report generation. By conditioning the model on diagnostic labels, the generated reports become more clinically accurate and aligned with the underlying pathology.

Methodology

Model Architecture

MedVLM consists of four key components:

  1. Image Feature Extraction: A pre-trained Vision Transformer (ViT) extracts visual features from CT scans. The last layer of the ViT is fine-tuned to adapt to medical imaging characteristics.
  2. Discrimination-Enhanced Module: This auxiliary component processes the extracted image features to predict diagnostic labels (e.g., pneumonia presence). The predicted labels are then embedded and concatenated with the original visual features to guide report generation.
  3. Vision-Language Alignment: A query-Transformer module maps visual features into the language model’s embedding space. This module is fully fine-tuned to ensure optimal alignment between visual and textual representations.
  4. Large Language Model: The ChatGLM-6B model generates the final report based on the aligned visual-textual inputs. To preserve the model’s general capabilities while adapting to medical reporting, selective fine-tuning strategies are applied to specific layers.

Discrimination-Enhanced Fine-Tuning

The discrimination-enhanced module is central to MedVLM’s performance improvement. It operates as follows:

  1. The image features are passed through a pre-trained classifier to obtain diagnostic probabilities (e.g., pneumonia likelihood).
  2. These probabilities are transformed into label embeddings and concatenated with the original visual features.
  3. A projection layer ensures the combined features maintain the required dimensionality for subsequent processing.

This approach ensures that the generated reports are not only fluent but also clinically relevant, as the model is explicitly guided by diagnostic information.

Parameter-Efficient Fine-Tuning Strategies

To mitigate the computational challenges of fine-tuning large models, MedVLM employs several parameter-efficient techniques:

  1. Freeze Fine-Tuning: Only specific layers (e.g., the last ViT layer or selected LLM layers) are updated while keeping others frozen. This reduces training costs while maintaining model stability.
  2. LoRA (Low-Rank Adaptation): Low-rank matrices are introduced alongside existing weights, allowing efficient updates without modifying the full parameter set.
  3. P-Tuning V2: Trainable virtual token embeddings are prepended to each Transformer layer’s input, enabling task-specific adaptation without extensive retraining.

These strategies ensure that MedVLM remains computationally feasible while achieving high performance in medical report generation.

Experiments and Results

Dataset

The experiments were conducted on the COV-CTR dataset, comprising 726 lung CT scans paired with radiology reports and binary pneumonia labels. The dataset was split into training (80%) and test (20%) sets. Each report includes detailed findings (“impression”) and diagnostic conclusions.

Evaluation Metrics

MedVLM was evaluated using:
• Text Generation Metrics: BLEU-4 and METEOR scores assess the fluency and relevance of generated reports.

• Diagnostic Accuracy: A separate BERT-based classifier evaluated whether the generated reports correctly identified pneumonia.

Results

MedVLM achieved:
• BLEU-4: 40.85% (40.41%–40.94%)

• METEOR: 70.56% (70.37%–70.8%)

• Pneumonia Diagnosis Accuracy: 87.67% (86.06%–87.39%)

These results significantly outperformed traditional CNN-RNN and CNN-Transformer baselines. Key findings include:

  1. Impact of Discrimination Enhancement: Models with the discrimination-enhanced module consistently outperformed those without it, demonstrating the value of diagnostic guidance.
  2. Fine-Tuning Strategy Comparison: Freezing and fine-tuning the query-Transformer yielded the best results, as it optimally aligned visual and textual features.
  3. Effect of Prompt Templates: Incorporating clinical question templates (e.g., “Describe the findings in this CT scan”) improved performance by aligning inputs with the LLM’s pre-training distribution.

Qualitative Analysis

Generated reports from MedVLM were more detailed and clinically accurate compared to baseline models. For example, while CNN-Transformer often produced generic phrases like “increased lung texture,” MedVLM generated specific descriptions such as “scattered ground-glass opacities in the lower lobes, consistent with viral pneumonia.”

Discussion

Advantages of MedVLM

  1. Clinical Relevance: By integrating diagnostic labels, MedVLM ensures reports are not only fluent but also medically accurate.
  2. Parameter Efficiency: Selective fine-tuning strategies reduce computational costs while maintaining performance.
  3. Scalability: The approach can be extended to other imaging modalities (e.g., X-rays, MRIs) with minimal architectural changes.

Limitations and Future Work

  1. Expert Validation: While automated metrics show promise, expert radiologist evaluations are needed to assess clinical utility fully.
  2. Multi-Task Learning: Extending MedVLM to handle multiple diseases simultaneously could improve robustness.
  3. Knowledge Integration: Incorporating external knowledge bases (e.g., medical ontologies) could further enhance report quality.

Conclusion

MedVLM represents a significant advancement in automated medical imaging report generation by combining multi-modal LLMs with discrimination-enhanced fine-tuning. The method’s ability to generate accurate, fluent, and clinically relevant reports demonstrates its potential to assist radiologists in routine practice. Future work will focus on expanding the model’s capabilities to broader medical imaging domains and improving its integration with clinical workflows.

doi.org/10.19734/j.issn.1001-3695.2024.08.0303

Was this helpful?

0 / 0