Author:
(1) Shadab Ahamed, University of British Columbia, Vancouver, BC, Canada, BC Cancer Research Institute, Vancouver, BC, Canada. He was also a Mitacs Accelerate Fellow (May 2022 - April 2023) with Microsoft AI for Good Lab, Redmond, WA, USA (e-mail: [email protected]);
(2) Yixi Xu, Microsoft AI for Good Lab, Redmond, WA, USA;
(3) Claire Gowdy, BC Children’s Hospital, Vancouver, BC, Canada;
(4) Joo H. O, St. Mary’s Hospital, Seoul, Republic of Korea;
(5) Ingrid Bloise, BC Cancer, Vancouver, BC, Canada;
(6) Don Wilson, BC Cancer, Vancouver, BC, Canada;
(7) Patrick Martineau, BC Cancer, Vancouver, BC, Canada;
(8) Franc¸ois Benard, BC Cancer, Vancouver, BC, Canada;
(9) Fereshteh Yousefirizi, BC Cancer Research Institute, Vancouver, BC, Canada;
(10) Rahul Dodhia, Microsoft AI for Good Lab, Redmond, WA, USA;
(11) Juan M. Lavista, Microsoft AI for Good Lab, Redmond, WA, USA;
(12) William B. Weeks, Microsoft AI for Good Lab, Redmond, WA, USA;
(13) Carlos F. Uribe, BC Cancer Research Institute, Vancouver, BC, Canada, and University of British Columbia, Vancouver, BC, Canada;
(14) Arman Rahmim, BC Cancer Research Institute, Vancouver, BC, Canada, and University of British Columbia, Vancouver, BC, Canada.
This study performs comprehensive evaluation of four neural network architectures (UNet, SegResNet, DynUNet, and SwinUNETR) for lymphoma lesion segmentation from PET/CT images. These networks were trained, validated and tested on a diverse, multi-institutional dataset of 611 cases. Internal testing (88 cases; total metabolic tumor volume (TMTV) range [0.52, 2300] ml) showed SegResNet as the top performer with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml; all networks had a median false negative volume (FNV) of 0 ml. On the unseen external test set (145 cases with TMTV range: [0.10, 2480] ml), SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best FNV of 0.41 ml. We assessed reproducibility of six lesion measures, calculated their prediction errors, and examined DSC performance in relation to these lesion measures, offering insights into segmentation accuracy and clinical relevance. Additionally, we introduced three lesion detection criteria, addressing the clinical need for identifying lesions, counting them, and segmenting based on metabolic characteristics. We also performed expert intra-observer variability analysis revealing the challenges in segmenting “easy” vs. “hard” cases, to assist in the development of more resilient segmentation algorithms. Finally, we performed inter-observer agreement assessment underscoring the importance of a standardized ground truth segmentation protocol involving multiple expert annotators. Code is available at: https://github.com/microsoft/lymphoma-segmentationdnn.
Index Terms— Positron emission tomography, computed tomography, deep learning, segmentation, detection, lesion measures, intra-observer variability, inter-observer variability
F LUORODEOXYGLUCOSE (18F-FDG) PET/CT imaging is the standard of care for lymphoma patients, providing accurate diagnoses, staging, and therapy response evaluation. However, traditional qualitative assessments, like Deauville scores [1], can introduce variability due to observer’s subjectivity in image interpretation. Using quantitative PET analysis that incorporates lesion measures such as mean lesion standardized uptake value (SUVmean), total metabolic tumor volume (TMTV), and total lesion glycolysis (TLG) offers a promising path to more reliable prognostic decisions, enhancing our ability to predict patient outcomes in lymphoma with greater precision and confidence [2].
Quantitative assessment in PET/CT imaging often relies on manual lesion segmentation, which is time-consuming and prone to intra- and inter-observer variabilities. Traditional thresholding-based automated techniques can miss low-uptake disease and produce false positives in regions of physiological high uptake of radiotracers. Therefore, deep learning offers promise for automating lesion segmentation, reducing variability, increasing patient throughput, and potentially aiding in the detection of challenging lesions [3].
Although promising, deep learning methods face challenges of their own. Convolutional neural networks (CNNs) require large, well-annotated datasets that can be difficult to obtain. Models trained on small datasets may not be generalizable. Moreover, lymphoma lesions vary significantly in size, shape, and metabolic activity, making training deep networks accurately challenging in the absence of well-defined priors. Deep learning aims to reduce observer variability, but inconsistent manual annotations used for training can lead to error perpetuation. Understanding these challenges is crucial towards harnessing the full potential of these methods in PET/CT quantitative analysis.
This paper is available on arxiv under CC 4.0 license.