ProLearn

Abstract

Automated medical report generation has the potential to significantly reduce the workload associated with time-consuming process of medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multi-modal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.

State-of-the-art performance in disease discovery

DTrace addresses the limitations of existing encoder–decoder frameworks by reliably capturing critical diagnostic details that are often overlooked.

In addition, DTrace demonstrates strong capabilities in image reconstruction. Even when reconstructing from images with 75% of pixels masked, DTrace produces cohesive and semantically faithful outputs that preserve both morphological and clinical consistency.

Interpretability analysis

To investigate how reconstructed images influence textual outputs, we conducted an interpretability analysis that reveals the close association between visual semantics and generated reports. These examples highlight that reconstructed images serve as a transparent window into DTrace’s decision-making process, showcasing both its strength in capturing nuanced clinical semantics and the challenges that arise when visual cues are not faithfully preserved.

BibTeX

      
        @article{ye2024dtrace,
          title={Dynamic Traceback Learning for Medical Report Generation},
          author={Ye, Shuchang and Meng, Mingyuan and Li, Mingjian and Feng, Dagan and Naseem, Usman and Kim, Jinman},
          journal={arXiv preprint arXiv:2401.13267},
          year={2024}
        }

DTrace: Dynamic Traceback Learning for Medical Report Generation

Abstract

State-of-the-art performance in disease discovery

Interpretability analysis

BibTeX