ProLearn: Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

1The University of Sydney
2Macquarie University

Abstract

Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, in ProLearn, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.

Performance degradation in limited text availability scenarios

To emulate real-world conditions where image-report pairing is incomplete, we simulate decreasing access to paired data ($50%$, $25%$, $10%$, $5%$, and $1%$) and benchmark ProLearn against state-of-the-art language-guided models. Unlike existing methods that suffer significant degradation as paired data drops, ProLearn consistently achieves superior performance. For example, on MosMedData+ with only $1%$ text, ProLearn achieves a Dice score of $0.7218$, surpassing SGSeg ($0.3452$) and LViT ($0.1677$). This highlights its strong generalization under sparse supervision.

Degradation

Qualitative and interpretability analysis.

Visual comparisons confirm that ProLearn maintains segmentation quality even as text guidance vanishes, while existing methods suffer from degraded localization and diffused saliency. ProLearn's attention remains coherent and lesion-focused, enabled by its semantic approximation mechanism.

Visualization

BibTeX

      
        @misc{ye2025prolearn,
          title={Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation}, 
          author={Shuchang Ye and Usman Naseem and Mingyuan Meng and Jinman Kim},
          year={2025},
          eprint={2507.11055},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2507.11055}, 
        }