SGSeg

Abstract

Segmentation of infected areas in chest X-rays is pivotal for facilitating the accurate delineation of pulmonary structures and pathological anomalies. Recently, multi-modal language-guided image segmentation methods have emerged as a promising solution for chest X-rays where the clinical text reports, depicting the assessment of the images, are used as guidance. Nevertheless, existing language-guided methods require clinical reports alongside the images, and hence, they are not applicable for use in image segmentation in a decision support context, but rather limited to retrospective image analysis after clinical reporting has been completed. In this study, we propose a self-guided segmentation framework (SGSeg) that leverages language guidance for training (multi-modal) while enabling text-free inference (uni-modal), which is the first that enables text-free inference in language-guided segmentation. We exploit the critical location information of both pulmonary and pathological structures depicted in the text reports and introduce a novel localization-enhanced report generation (LERG) module to generate clinical reports for self-guidance. Our LERG integrates an object detector and a location-based attention aggregator, weakly-supervised by a location-aware pseudo-label extraction module. Extensive experiments on a well-benchmarked QaTa-COV19 dataset demonstrate that our SGSeg achieved superior performance than existing uni-modal segmentation methods and closely matched the state-of-the-art performance of multi-modal language-guided segmentation methods.

Performance Comparison

Performance comparison between our SGSeg and existing uni-modal and multi-modal segmentation methods on the QaTa-COV19 dataset. Results illustrate that SGSeg exceeds the performance of conventional uni-modal methods and closely matches that of advanced multi-modal approaches.

$$ \begin{array}{llccc} \hline \text{Modality} & \text{Model} & \text{Accuracy} & \text{Dice} & \text{Jaccard} \\ \hline \text{Uni-Modal} & \text{U-Net} & 0.945 & 0.819 & 0.692 \\ & \text{U-Net++} & 0.947 & 0.823 & 0.706 \\ & \text{Attention U-Net} & 0.945 & 0.822 & 0.701 \\ & \text{Trans U-Net} & 0.939 & 0.806 & 0.687 \\ & \text{Swin U-Net} & 0.950 & 0.832 & 0.724 \\ \hline \text{Multi-Modal Train, Uni-Modal Inference} & \text{SGSeg (ours)} & \textbf{0.971} & \textbf{0.874} & \textbf{0.778} \\ \hline \text{Multi-Modal} & \text{LViT} & 0.962 & 0.837 & 0.751 \\ & \text{LanGuideSeg} & \underline{0.975} & \underline{0.898} & \underline{0.815} \\ \hline \end{array} $$

Qualitative Analysis

Comparative Analysis of Segmentation Results: Uni-modal vs. Multi-modal Methods, which illustrated the significant impact of textual information on enhancing segmentation accuracy, particularly in challenging cases.

Interpretability

We observe a clear correspondence between the quality of the textual input and the segmentation output: accurate reports yield precise segmentations, whereas inaccuracies in the reports result in degraded performance. This alignment underscores the interpretability of our model.

BibTeX

      
        @InProceedings{10.1007/978-3-031-72111-3_23,
          author="Ye, Shuchang
          and Meng, Mingyuan
          and Li, Mingjian
          and Feng, Dagan
          and Kim, Jinman",
          title="Enabling Text-Free Inference in Language-Guided Segmentation of Chest X-Rays via Self-guidance",
          booktitle="Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024",
          year="2024",
          publisher="Springer Nature Switzerland",
          address="Cham",
          pages="242--252",
          isbn="978-3-031-72111-3"
        }

SGSeg: Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidance

Abstract

Performance Comparison

Qualitative Analysis

Interpretability

BibTeX