Impact statement
Initially, this review discusses existing approaches in automated renal cancer diagnosis, and methods across broader AI research, to summarise the existing state of AI cancer analysis. Then, this review matches these methods to the unique constraints of early renal cancer detection and proposes promising directions for future research that may enable AI-based early renal cancer detection via CT screening.
Introduction
In 2017, 393,000 renal cancer (RC) diagnoses and 139,000 RC deaths were recorded worldwide (Fitzmaurice et al., Reference Fitzmaurice, Abate, Abbasi, Abbastabar, Abd-Allah, Abdel-Rahman, Abdelalim, Abdoli, Abdollahpour, Abdulle, Abebe, Abraha, Abu-Raddad, Abualhasan, Adedeji, Advani, Afarideh, Afshari, Aghaali and Aghaali2019). Renal cell carcinoma (RCC), the most common cancer involving the kidney, is mostly discovered incidentally during routine health checks or in the assessment of unrelated symptoms, and patients with incidentally discovered RCC tend to have better health outcomes than those diagnosed with symptomatic RCC (Rabjerg et al., Reference Rabjerg, Mikkelsen, Walter and Marcussen2014; Vasudev et al., Reference Vasudev, Wilson, Stewart, Adeyoju, Cartledge, Kimuli, Datta, Hanbury, Hrouda, Oades, Patel, Soomro, Sullivan, Webster, Selby and Banks2020). This is because symptom presentation is generally associated with later-stage progression (Rabjerg et al., Reference Rabjerg, Mikkelsen, Walter and Marcussen2014; Vasudev et al., Reference Vasudev, Wilson, Stewart, Adeyoju, Cartledge, Kimuli, Datta, Hanbury, Hrouda, Oades, Patel, Soomro, Sullivan, Webster, Selby and Banks2020). As shown in Table 1, RC screening satisfies many of the 10 Wilson–Junger criteria of an effective screening program (Wilson et al., Reference Wilson, Jungner, Wilson and Jungner1968; Rossi et al., Reference Rossi, Klatte, Usher-Smith and Stewart2018); in principle, regular RC screening could improve general survival rates by increasing the rate of early RC discovery.
Note. Y: Yes, currently satisfied; ?: Unknown, more research is needed to clarify; N: No, currently unsatisfied.
However, there are significant challenges associated with deploying the current standard method for RC discovery, contrast-enhanced computed tomography (CECT; Ljungberg et al., Reference Ljungberg, Bensalah, Canfield, Dabestani, Hofmann, Hora, Kuczyk, Lam, Marconi, Merseburger, Mulders, Powles, Staehler, Volpe and Bex2015; Guidelines for the Management of Renal Cancer, 2016), in RC screening: the high cost of computed tomography (CT) screening (Beinfeld et al., Reference Beinfeld, Wittenberg and Gazelle2005; Ishikawa et al., Reference Ishikawa, Aoki, Ohwada, Takahashi, Morishita and Ueda2007; Jensen et al., Reference Jensen, Siersma, Rasmussen and Brodersen2020), the risks of routine radiation exposure (Hunink and Gazelle, Reference Hunink and Gazelle2003), the lack of a definite target screening population (Rossi et al., Reference Rossi, Klatte, Usher-Smith and Stewart2018), and the low incidence of RC in the general population (O’Connor et al., Reference O’Connor, Pickhard, Kim, Oliva and Silverman2011, Reference O’Connor, Silverman, Cochon and Khorasani2018). These facts undermine LDCT’s cost-effectiveness and suitability for ongoing screening – Wilson–Junger criteria 9 and 10, respectively. Nevertheless, recent literature has indicated that cancer screening with low-dose computed tomography (LDCT) may improve population health and studies are ongoing in this area (NLST, 2011; Black et al., Reference Black, Gareen, Soneji, Sicks, Keeler, Aberle, Naeim, Church, Silvestri, Gorelick and Gatsonis2014; Stewart, Reference Stewart2021). Furthermore, developments in artificial intelligence (AI) have enabled the automation of some radiological tasks that may reduce the cost of CT analysis. Following these developments, this manuscript reviews AI technologies across automated RC diagnosis, other cancer domains, and broader computer vision to suggest novel research directions that may enable RC early detection in LDCT and non-contrast CT (NCCT), by automating and reducing the cost of analyses inherent to CT screening.
In this review, we define ‘early detection’ as the processes requisite in screening that detect early signs of disease in asymptomatic individuals. Image-based early detection and diagnosis may share many sub-processes, such as pre-processing, segmentation, radiomic feature extraction, post-processing, and classification. Within these sub-processes, segmentation and classification are the subjects of most machine learning research. Segmentation algorithms receive images as input and assign to them element-wise labels according to predefined semantic values, providing structure to images by highlighting the most salient regions of interest (ROI), making automated analyses simpler. An example of two-dimensional segmentation is shown in Figure 1. Classification refers to any process that assigns a discrete category to a data source; classification algorithms receive quantitative data (e.g., radiomic features, morphological measurements from a histology slide, or raw pixel data from an image) and assign a label to the data source; this label can be binary (malignant/benign) or multi-class (differentiating between RCC subtypes).
Early detection methods must be cheap to be viable in screening. They must also be accurate, to detect a high rate of the target disease whilst minimising the rate of overdiagnosis, which can dramatically increase screening costs. AI analyses are automated by default, making them cheap enough to be operationally viable in screening. Therefore, the development of an AI-based RC early detection system should focus on optimising the AI system’s accuracy to maximise the system’s utility in screening.
This manuscript reviews existing AI diagnostic methods that may be suitable for early detection, and suggests possible improvements to these existing methods, due to the lack of existing AI research in RC early detection. The literature reviewed in this manuscript was extracted from three different sources, namely (i) Kidney and Tumour Segmentation Challenge (KiTS) winning submissions; (ii) ImageNet (March 2022), including four contemporary, high-scoring algorithms and four other highly cited algorithms often used in medical AI, and (iii) renal segmentation and classification articles (Google Scholar, January 2015–March 2022). A list of all papers initially selected for reading, and then finally included, in this review can be found in the Supplementary Material. The review is complemented by highly cited articles from other early detection domains, that may represent novel approaches for conducting AI LDCT screening for RC, and the broader AI literature, including hyperparameter optimisation, multi-task learning (MTL), and synthetic image generation.
AI primer
AI refers to any computational, data-driven decision-making system that enables the automation of complex tasks – mimicking human intelligence – without explicit instruction. Machine-learning models are a subset of AI systems that automatically learn to structure and/or make predictions, or ‘inferences’, from data. Supervised learning models learn using labelled datasets – a set of paired inputs and labelled outputs. In segmentation, labelled datasets contain CT scans and volumes of corresponding voxel-wise labels for each scan. Supervised learning models review labelled data during ‘training’, iteratively assessing each sample and altering its own mathematical parameters to progressively improve inference accuracy. Following training, a supervised learning model’s accuracy is evaluated over an unseen ‘validation’ labelled dataset, where the differences between the model’s inferences and the dataset’s labels are evaluated to determine the model’s overall accuracy. This manuscript exclusively reviews supervised machine-learning methods but, for brevity, ‘AI’ will be used as a general term for all models.
In classification and segmentation, the model’s responses can be categorised as true positive (TP), true negative (TN), false positive (FP), or false negative (FN). Accuracy metrics are derived from the ratio of these response classifications, such as sensitivity and specificity,
where $ {n}_x $ refers to the number of $ x $ observed in validation. Optimum performance usually requires a trade-off between maximum specificity and maximum sensitivity; the area under the receiver operating characteristic curve (AUC) and the Dice similarity coefficient (DSC) are commonly used accuracy metrics that quantify the model’s trade-off between specificity and sensitivity. AUC is generated by plotting the model’s receiver operating characteristic (specificity vs. sensitivity) and calculating the area under its curve; an example ROC is shown in Figure 2 for the reader’s understanding. Segmentation performance is generally evaluated by the DSC metric, defined by
Contemporary AI algorithms in image analysis tend to be comprised of convolutional neural networks (CNN) and/or transformers. This manuscript will not discuss the technical differences between these models, beyond the functional differences that exist with respect to their typical performance and cost characteristics. Both are deep learning algorithms (DL), meaning they are both types of neural network. The cost of CNNs scales linearly with the number of input image elements, whereas transformer cost scales quadratically, making transformers-only models much costlier during analyses of 3D images, such as in CT. Transformers can achieve ‘global’ attention and detect patterns across whole input images simultaneously, whereas CNNs can only achieve ‘local’ pattern recognition, as they must divide input images into smaller sections and analyse them individually. This leads to superior image analysis in transformer models where patterns have global interdependencies. The performance of CNN- and transformer-based models will be reviewed in this manuscript, as well as hybrid models that attempt to combine the benefit of both approaches.
AI in renal cell carcinoma diagnosis
Segmentation
Renal segmentation has received increased research attention following the advent of KiTS, first established in 2019 (Heller et al., Reference Heller, Sathianathen, Kalapara, Walczak, Moore, Kaluzniak, Rosenberg, Blake, Rengel, Oestreich, Dean, Tradewell, Shah, Tejpaul, Edgerton, Peterson, Raza, Regmi, Papanikolopoulos and Weight2019) and renewed in 2021. KiT19 and KiTS21 publicly released 210 CECT volumes and 300 CECT volumes, respectively, where all CT scans contained tumours and some contained cysts, and invited participants to submit their renal segmentation algorithms to compete in a fair assessment of accuracy. KiTS19’s winner, based on nnU-Net (Isensee and Maier-Hein, Reference Isensee and Maier-Hein2019; Isensee et al., Reference Isensee, Jaeger, Simon, Petersen and Maier-Hein2021), was derived from the state-of-the-art segmentation CNN U-Net (Ronneberger et al., Reference Ronneberger, Fischer and Brox2015) and focused on the optimisation of its hyperparameters (properties relating to training and model size) without altering the essential structure of U-Net. This approach represented a breakaway from the hitherto standard across segmentation research, of proposing modular architectural changes to U-Net for marginal accuracy gains. Outside of KiTS, nnU-Net scored highly in a wide variety of segmentation domains, winning other medical segmentation competitions across multiple organ sites (Isensee et al., Reference Isensee, Jaeger, Simon, Petersen and Maier-Hein2021), proving the primacy of hyperparameter optimisation in maximising segmentation performance.
All KiTS21’s top-7 performing submissions made direct use of nnU-Net as a baseline algorithm. The top-3 submissions used nnU-Net’s ‘course-to-fine’ cascade approach. In this approach, a ‘course’ U-Net segments the input CT images at a low resolution to dictate an initial ROI; then, this segmentation inference is refined at higher resolutions by more ‘fine’ U-Nets. This process is repeated until the ROIs are labelled at full resolution. Figure 3 shows the performance distributions of KiTS19 and KiTS21’s top-7 submissions in renal segmentation – the adoption of nnU-Net significantly increased the mean mass-segmentation DSC among top performers (p = 3.58 × 10−5) from 0.832 in KiTS19 to 0.870 in KiTS21 (Challenge Leaderboard, 2019; KiTS21, 2021). To the authors’ knowledge, no other kidney segmentation algorithm has significantly improved upon KiTS21’s competition-winning nnU-Net-based approach (Zhao et al., Reference Zhao, Chen and Wang2022).
Manual NCCT screening exhibits potential as a medium for RC early detection (O’Connor et al., Reference O’Connor, Pickhard, Kim, Oliva and Silverman2011, Reference O’Connor, Silverman, Cochon and Khorasani2018), yet there has been little supporting research in NCCT segmentation that may assist the automation of NCCT screening. LDCT and NCCT images are significantly noisier and less differentiated than in CECT, respectively, making target organs harder to distinguish for AI algorithms. Transference of segmentation algorithms between the CECT and NCCT or LDCT may be non-trivial due to the differences in image quality, thus new work must quantify the performance of segmentation within NCCT images, to verify the suitability of segmentation-based RC early detection in NCCT.
Classification
Renal classification algorithms generally fall into one of the following characterisations: DL-based (Han et al., Reference Han, Hwang and Lee2019; Tabibu et al., Reference Tabibu, Vinod and Jawahar2019; Fenstermaker et al., Reference Fenstermaker, Tomlins, Singh, Wiens and Morgan2020; Oberai et al., Reference Oberai, Varghese, Cen, Angelini, Hwang, Gill, Aron, Lau and Duddalwar2020; Pedersen et al., Reference Pedersen, Andersen, Christiansen and Azawi2020; Tanaka et al., Reference Tanaka, Huang, Marukawa, Tsuboi, Masaoka, Kojima, Iguchi, Hiraki, Gobara, Yanai, Nasu and Kanazawa2020; Zabihollahy et al., Reference Zabihollahy, Schieda, Krishna and Ukwatta2020; Uhm et al., Reference Uhm, Jung, Choi, Shin, Yoo, Oh, Kim, Kim, Lee, Youn, Hong and Ko2021), feature analysis-based (Hodgdon et al., Reference Hodgdon, Matthew, Schieda, Flood, Lamb and Thornhill2015; Schieda et al., Reference Schieda, Thornhill, Al-Subhi, Matthew, Shabana, van der Pol and Flood2015; Feng et al., Reference Feng, Rong, Cao, Zhou, Zhu, Yan, Liu and Wang2018; Kocak et al., Reference Kocak, Yardimci, Bektas, Turkcanoglu, Erdim, Yucetas, Koca and Kilickesmez2018; Lee et al., Reference Lee, Hong, Kim and Jung2018; Schieda et al., Reference Schieda, Lim, Krishna, Matthew, Flood and Thornhill2018; Varghese et al., Reference Varghese, Chen, Hwang, Cen, Desai, Gill and Duddalwar2018; Erdim et al., Reference Erdim, Yardimci, Bektas, Kocak, Koca, Demir and Kilickesmez2020; Ma et al., Reference Ma, Cao, Xu and Ma2020; Sun et al., Reference Sun, Feng, Xu, Zhang, Zhu, Yang and Zhang2020; Wang et al., Reference Wang, Song and Jiang2021), or a hybrid approach (Lee et al., Reference Lee, Hong, Kim and Jung2018; Tabibu et al., Reference Tabibu, Vinod and Jawahar2019). The higher inference time and cost of DL-based algorithms compared to feature-based algorithms is undesirable, but DL-based approaches tend to be more accurate.
DL-based classification approaches generally use ‘fine-tuned’ versions of pretrained CNN classifiers (such as ResNet, He et al., Reference He, Zhang, Ren and Sun2016; VGG, Simonyan and Zisserman, Reference Simonyan and Zisserman2015; or Inception, Szegedy et al., Reference Szegedy, Vanhoucke, Ioffe, Shlens and Wojna2016). Fine-tuning in this context means to retrain an already existing pretrained model to operate effectively in a new domain. This approach minimises the need for domain-specific labelled images (and, therefore, minimises labelling), and provides good classification performance. Feature-based algorithms operate on predetermined ROIs – image sections segmented by a radiologist or AI algorithm – and use radiomic and/or DL-derived features, that describe relationships in the local distribution of CT intensities, to classify disease.
Deep learning-based classifiers can achieve high accuracy in CT images with very little manual intervention. Tanaka et al. (Reference Tanaka, Huang, Marukawa, Tsuboi, Masaoka, Kojima, Iguchi, Hiraki, Gobara, Yanai, Nasu and Kanazawa2020) sought to quantify small (≤4 cm) renal mass detection accuracy in CT using axial CT slices and a fine-tuned InceptionV3 CNN; they differentiated malignant and benign masses with a maximum AUC of 0.846 in CECT and 0.562 in NCCT. Pedersen et al. (Reference Pedersen, Andersen, Christiansen and Azawi2020) trained a similar 2D slice-classifying CNN, but used it to classify each slice within each known mass’ 3D volumes to enable a slice-based voting system to differentiate patient-level RC from oncocytoma, returning a perfect validation accuracy of 100%. Han et al. (Reference Han, Hwang and Lee2019) sought to differentiate between clear cell RCC (ccRCC) and non-ccRCC from known RCC masses, using radiologist-selected axial CT slices from NCCT and two CECT phases, and achieved sub-type classification AUCs between 0.88 and 0.94 in an internal testing dataset.
Classification has also been performed with the following feature-based supervised learning models: support vector machines (SVM; Hodgdon et al., Reference Hodgdon, Matthew, Schieda, Flood, Lamb and Thornhill2015; Schieda et al., Reference Schieda, Thornhill, Al-Subhi, Matthew, Shabana, van der Pol and Flood2015; Kocak et al., Reference Kocak, Yardimci, Bektas, Turkcanoglu, Erdim, Yucetas, Koca and Kilickesmez2018; Erdim et al., Reference Erdim, Yardimci, Bektas, Kocak, Koca, Demir and Kilickesmez2020; Sun et al., Reference Sun, Feng, Xu, Zhang, Zhu, Yang and Zhang2020), multi-layer perceptrons (MLP; Kocak et al., Reference Kocak, Yardimci, Bektas, Turkcanoglu, Erdim, Yucetas, Koca and Kilickesmez2018; Erdim et al., Reference Erdim, Yardimci, Bektas, Kocak, Koca, Demir and Kilickesmez2020), logistic regressions (LR; Hodgdon et al., Reference Hodgdon, Matthew, Schieda, Flood, Lamb and Thornhill2015; Schieda et al., Reference Schieda, Thornhill, Al-Subhi, Matthew, Shabana, van der Pol and Flood2015; Schieda et al., Reference Schieda, Lim, Krishna, Matthew, Flood and Thornhill2018; Varghese et al., Reference Varghese, Chen, Hwang, Cen, Desai, Gill and Duddalwar2018; Ma et al., Reference Ma, Cao, Xu and Ma2020; Wang et al., Reference Wang, Song and Jiang2021), and decision tree methods (DT; Lee et al., Reference Lee, Hong, Kim and Jung2018; Erdim et al., Reference Erdim, Yardimci, Bektas, Kocak, Koca, Demir and Kilickesmez2020). Some feature-based models have shown superior diagnostic performance to expert radiologists: Hodgdon et al.’s (Reference Hodgdon, Matthew, Schieda, Flood, Lamb and Thornhill2015) SVM-based approach classified RC in NCCT images with an AUC of around 0.85; this was much greater than the radiologists’ AUCs of 0.65 and 0.74. Sun et al.’s (Reference Sun, Feng, Xu, Zhang, Zhu, Yang and Zhang2020) ‘radiologic-radiomic’ SVM model, where ‘radiologic’ refers to human-derived radiographic features and ‘radiomic’ refers to machine-derived radiographic features, differentiated RCC subtypes from benign masses. Sun et al. (Reference Sun, Feng, Xu, Zhang, Zhu, Yang and Zhang2020) reported their accuracies in DSC, achieving an average of 88.3% DSC, improving upon the 78.2% average expert radiologist’s DSC (individual radiologists varied between 73.2 and 84.1%).
Across RC classification literature, the interaction between feature analysis and DL models is limited. Tabibu et al.’s (Reference Tabibu, Vinod and Jawahar2019) classification pipeline sends patches of histopathological images to two CNNs – one CNN classifies each patch as benign/malignant, and the other generates features that are used to differentiate between RCC subtypes in a three-class SVM. In internal validation, performing classification on histopathological images, this method achieved up to 0.99 patch-wise malignancy-identification AUC, and 0.93 subtype-identification AUC. Lee et al.’s (Reference Lee, Hong, Kim and Jung2018) approach concatenated radiomic features with a CNN output, both evaluated over a pre-segmented ROI in a CT image and fed this concatenation to a DT classifier that differentiated angiomyolipoma without visible fat from RC with up to 0.816 AUC.
Object detection has rarely been applied to renal mass detection in CT (Yan et al., Reference Yan, Wang, Lu and Summers2018; Xiong et al., Reference Xiong, Zhang, Chen and Song2019; Zhang et al., Reference Zhang, Chen, Song, Xiong, Yang and Jonathan Wu2019). Zhang et al.’s (Reference Zhang, Chen, Song, Xiong, Yang and Jonathan Wu2019) renal lesion detector show a mass-level detection AUC of 0.871 in CECT; they did not compare this performance to expert radiologist performance over the same validation dataset. As in segmentation, the reduced image quality of NCCT may present issues for AI lesion detection algorithms; thus, to ensure suitability in early detection, work must be done to quantify object detection performance in NCCT.
MTL and synthetic image generation
AI has been used to support RC diagnosis in other interesting manners, including MTL and synthetic image generation (SIG). SIG aims to create new images that mimic the appearance of authentic medical images. In RC, SIG has been used to improve segmentation performance (roughly 0.5% DSC improvement, Jin et al., Reference Jin, Cui, Sun, Meng and Su2021) by synthetically expanding the size of labelled training datasets, and shows promise in improving classification performance by synthetically transferring images to more diagnostically-useful domains, such as from NCCT to CECT (Liu et al., Reference Liu, Tian, Ağıldere, Haberal, Coşkun, Duzgol and Akin2020; Sassa et al., Reference Sassa, Kameya, Takahashi, Matsukawa, Majima, Tsuruta, Kobayashi, Kajikawa, Kawanishi, Kurosu, Yamagiwa, Takahashi, Hotta, Yamada and Yamamoto2022). However, to the authors’ knowledge, no research has quantified the improvement in RC classification performance directly attributable to synthetic domain transfer between NCCT and CECT. MTL has been used in RC evaluation to combine learning from multiple tasks, such that they simultaneously contribute towards model training – Ruan et al. (Reference Ruan, Li, Marshall, Miao, Cossetto, Chan, Daher, Accorsi, Goela and Li2020) noted a 3% segmentation DSC improvement following MTL, and Pan et al. (Reference Pan, Shu, Coatrieux, Yang, Wang, Lu, Zhou, Kong, Tang, Zhu and Dillenseger2019) noted how classification and segmentation performance scores were both individually improved when trained together in MTL.
Alternate methods of using medical AI
Alternate detection paradigms
Rather than removing the need for pathologist personnel in screening, Gehrung et al.’s (Reference Gehrung, Crispin-Ortuzar, Berman, O’Donovan, Fitzgerald and Markowetz2021) AI approach generated a proxy ‘confidence’ rating to triage patients suspected of having Barrett’s oesophagus, a precancerous state for oesophageal cancer. Their AI detected ‘indeterminate’ cases and sent these to an expert pathologist, whilst accurately assigning classifications to ‘clear’ cases. Gehrung et al.’s (Reference Gehrung, Crispin-Ortuzar, Berman, O’Donovan, Fitzgerald and Markowetz2021) triage approach was rigorously assessed across multiple validation datasets and was estimated to reduce pathologist workloads by 57% without a reduction in accuracy, improving the cost-effectiveness of screening. As in Barret’s oesophagus, triaging AI may be practicable in LDCT RC screening and improve the process’ cost-effectiveness (Wilson–Junger criterion 8, Table 1).
Khosravan et al. (Reference Khosravan, Celik, Turkbey, Jones, Wood and Bagci2019) found that humans tend to have higher specificity and AI algorithms tend to have higher sensitivity in NCCT lung cancer detection; in response, they constructed a ‘complimentary’ computer-aided diagnosis system to bridge the performance gap between radiologists and AI. Khosravan et al.’s (Reference Khosravan, Celik, Turkbey, Jones, Wood and Bagci2019) system let a radiologist evaluate an input NCCT image as the AI system segmented and classify each gaze-deduced region of interest, generated by the radiologist’s eye movement, automatically. This study failed to specify the improvement in cancer detection, or workload reduction, directly attributable to their software, instead plainly evaluated the performance of segmentation (91% DSC) and classification (97% accuracy – AUC not reported).
Object detection in AI cancer detection
Ardila et al. (Reference Ardila, Kiraly, Bharadwaj, Choi, Reicher, Peng, Tse, Etemadi, Ye, Corrado, Naidich and Shetty2019) used an object-detection algorithm to identify lung nodules in NCCT with high accuracy, allowing patient-level early cancer detection AUC of 0.944. Welikala et al. (Reference Welikala, Remagnino, Lim, Chan, Rajendran, Kallarakkal, Zain, Jayasinghe, Rimal, Kerr, Amtha, Patil, Tilakaratne, Gibson, Cheong and Barman2020) used an object detection algorithm to identify oral lesions in plain photographic images of the oral cavity, allowing patient-level cancer classification, and achieving a patient-level classification DSC between 78 and 87% (AUC not reported). Nguyen et al. (Reference Nguyen, Yang, Deng, Lu, Zhu, Roland, Lu, Landman, Fogo and Huo2022) proposed a circular ‘bounding-box’ object detection algorithm for general biological purposes, as certain biological structures tend to be more circular/spherical than rectangular/cuboidal such as cells, masses, and some organs. They proved that their ‘CircleNet’ object-detection algorithm showed overall superior performance to other state-of-the-art algorithms in detecting nuclei and glomeruli.
Synthetic image generation
Santini et al.’s (Reference Santini, Zumbo, Martini, Valvano, Leo, Ripoli, Avogliero, Chiappino and Latta2018) DL workflow synthetically enhance NCCT images, promoting them to pseudo-CECT, to enable accurate estimation of patient cardiac volumes. Santini et al. (Reference Santini, Zumbo, Martini, Valvano, Leo, Ripoli, Avogliero, Chiappino and Latta2018) proved the efficacy of this method by highlighting the segmentation improvement associated with synthetic CECT generation; their framework, performing segmentation over synthetic CECTs, was more accurate than a human over an equivalent set of NCCTs (DSC of 0.89 and 0.85, respectively). Hu et al. (Reference Hu, Oda, Hayashi, Lu, Kumamaru, Akashi, Aoki and Mori2022) built a generative adversarial network (GAN) to generate realistic synthetic CECT images that improve the conspicuity of abdominal aortic aneurysms in NCCT images. Their GAN made use of U-Net to generate synthetic CECT images, and was trained in MTL – using vascular structure segmentation as an auxiliary task to boost the performance of CECT generation. Hu et al. (Reference Hu, Oda, Hayashi, Lu, Kumamaru, Akashi, Aoki and Mori2022) found that their GAN outperformed stand-alone U-Net, and other SIG algorithms such as pix2pix (Isola et al., Reference Isola, Zhu, Zhou and Efros2017) and MW-CNN (Liu et al., Reference Liu, Zhang, Zhang, Lin and Zuo2018), in terms of average validation error and signal-to-noise ratio. Qualitatively, Hu et al. (Reference Hu, Oda, Hayashi, Lu, Kumamaru, Akashi, Aoki and Mori2022) showed clearly that the noise produced in U-Net-based NCCT to CECT translation is minimised by its incorporation into a GAN. Hu et al. (Reference Hu, Oda, Hayashi, Lu, Kumamaru, Akashi, Aoki and Mori2022) did not directly quantify the improvement in aneurysm detection directly attributable to their synthetic CT enhancement, but they did determine case-level aneurysm detection DSC to be 85%.
Emergent ideas across AI and computer vision
Segmentation
Yang et al. (Reference Yang, Hu, Babuschkin, Sidor, Liu, Farhi, Ryder, Pachocki, Chen and Gao2022) found that exhaustive hyperparameter optimisation of large AI models, such as CNNs and transformers, is possible – they showed neural networks over a very large range of sizes can share common optimal hyperparameters if they are initialised ‘correctly’. This correct initialisation allows grid-search-based objective hyperparameter optimisation, which nnU-Net established as primarily important in segmentation. Also, the intrinsic locality of convolutional operations in CNNs may limit U-Net’s performance in segmentation tasks with global pattern dependencies. Introducing transformers, capable of global attention and understanding the relationships between all input data, to the U-Net architecture may allow the model to ‘see’ much larger volumes during segmentation, which may improve segmentation accuracy. TransU-Net and UNETR both implemented transformers into U-Net’s CNN architecture and significantly improved upon U-Net’s segmentation performance in multi-organ segmentation tasks (Chen et al., Reference Chen, Lu, Yu, Luo, Adeli, Wang, Lu, Yuille and Zhou2021; Hatamizadeh et al., Reference Hatamizadeh, Tang, Nath, Yang, Myronenko, Landman, Roth and Xu2022).
Classification
Following the introduction of transformers (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017; Dosovitskiy et al., Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby2020), a new generation of state-of-the-art classifiers (including ConvNeXt, Liu et al., Reference Liu, Mao, Wu, Feichtenhofer, Darrell and Xie2022), Swin (Liu et al., Reference Liu, Lin, Cao, Hu, Wei, Zhang, Lin and Guo2021) and CoaT (Xu et al., Reference Xu, Xu, Chang and Tu2021), have superseded the commonly used CNNs Resnet, VGG and Inception in terms of ImageNet classification accuracy. This new generation shows improved performance over the same tasks due to their new training regimes, new hyperparameters and new architectures. ConvNeXT (which, like the previous generation of classifiers, is a pure CNN) tweaked its properties to take advantage of insights made by transformers models (Liu et al., Reference Liu, Mao, Wu, Feichtenhofer, Darrell and Xie2022) and shows improved performance over the previous generation without incurring greater cost during inference.
Multi-task learning
Standley et al. (Reference Standley, Zamir, Chen, Guibas, Malik and Savarese2020) assessed various methods of combining AI training regimes. They found that some ‘complex’ tasks, such as segmentation, require greater number of training samples for optimal performance than other ‘simpler’ tasks, and that these more complex tasks’ performances would suffer if paired with a simple task in MTL. Standley et al. (Reference Standley, Zamir, Chen, Guibas, Malik and Savarese2020) also found that some tasks seemed to consistently act as ‘auxiliaries’ – boosting the learning performance of the network for other tasks without ever performing significantly well themselves in MTL. Despite these findings, they found that the relationships between task pairings – that is, the tendency of tasks to help or hinder each other’s training during MTL – was not independent of the training setup, meaning MTL relationships between tasks cannot be completely generalised across models with distinct network architectures, hyperparameters, and training data.
Discussion
Renal segmentation has the potential in assisting RC diagnosis – for example, accurately delineating tumour regions enables feature-based classification, which shows comparable, or superior, diagnostic performance to expert radiologists. Maximising renal segmentation accuracy in LDCT may enable accurate feature-based classification methods to be applied in LDCT early detection automatically, removing much of the manual labour of RC screening. High accuracy is essential in early detection methods; thus, given the accuracy of the feature-based classification methods in NCCT imaging (as in Hodgdon et al., Reference Hodgdon, Matthew, Schieda, Flood, Lamb and Thornhill2015), a high-accuracy renal segmentation method for LDCT is likely to enable RC early detection screening.
Whilst nnU-Net established the primacy of hyperparameter optimisation in segmentation performance, it does not provide a framework for hyperparameter optimisation itself, instead relying on experimentally derived heuristics for hyperparameter selection. Using Yang et al.’s (Reference Yang, Hu, Babuschkin, Sidor, Liu, Farhi, Ryder, Pachocki, Chen and Gao2022) ‘maximal parameter update’ hyperparameter optimisation allows a definitive optimisation of any CNN or transformer, which should improve upon nnU-Net’s heuristics-led approach. Also, despite nnU-Net’s state-of-the-art inter-domain performance, the intrinsic locality of convolutional operations in U-Net’s purely convolutional architecture may limit its segmentation performance. Introducing transformers to U-Net’s architecture, as in TransU-Net, enables global attention mechanisms that may improve RC segmentation accuracy over a whole NCCT volume. Applying transformer-informed segmentation methods like TransU-Net, and objectively optimising its hyperparameters using ‘maximal parameter updates’ may improve RC segmentation performance over existing nnU-Net-led approaches.
Given the potential for RC early detection in LDCT, there is a need for more research quantifying RC segmentation performance in LDCT. Investigations into general NCCT segmentation have shown that using synthetic contrast enhancement as an auxiliary training task in MTL can improve segmentation accuracy. Therefore, an investigation in renal LDCT segmentation may be improved by introducing synthetic enhancement to CECT as an auxiliary learning task in MTL. Such an investigation would likely be complicated by Standley et al. (Reference Standley, Zamir, Chen, Guibas, Malik and Savarese2020) findings – that MTL task relationships can be unique to each configuration of network architecture, hyperparameters, and dataset domain.
Like segmentation, the lack of research quantifying RC object detection performance in LDCT represents a gap in the literature. Object detection and classification performance could be improved by the introduction of the new generation transformer-inspired classifiers that consistently show higher classification accuracies than their predecessors. Also, assessing the MTL relationship between classification, segmentation, and object detection in RC early detection may lead to improved mass detection, and therefore early detection, performance.
Pedersen et al.’s (Reference Pedersen, Andersen, Christiansen and Azawi2020) and Gehrung et al.’s (Reference Gehrung, Crispin-Ortuzar, Berman, O’Donovan, Fitzgerald and Markowetz2021) approach of generating an image-based intra-patient biomarker voting system may be applicable to RC early detection. Both Pedersen et al. (Reference Pedersen, Andersen, Christiansen and Azawi2020) and Gehrung et al. (Reference Gehrung, Crispin-Ortuzar, Berman, O’Donovan, Fitzgerald and Markowetz2021) evaluated biomarker presence in fractionated tiles of input images and used the ratio of biomarker-positive to biomarker-negative tiles to classify the inputs, leading to high-accuracy results in validation. Applying an analogous approach, using the new generation of classifiers, to the early detection of RC masses in LDCT could enable highly robust automated triaging, or diagnosis, for RC early detection screening programmes.
Conclusion
This manuscript highlights and summarises existing AI method in RC diagnosis and suggests how these can be repurposed to enable RC early detection. After summarising existing segmentation, classification, and other AI methods in RC diagnosis, a review of analogous cancer detection and diagnosis methods across broader cancer literature and computer vision was conducted. Contrasting the RC-specific workflows to their equivalents across computer vision and other cancer domains allowed the generation of novel RC-specific research proposals that may enable AI-based RC early detection.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/pcm.2022.9.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/pcm.2022.9.
Financial support
This work was supported by the International Alliance for Cancer Early Detection, a partnership between Cancer Research UK (C14478/A27855), Canary Center at Stanford University, the University of Cambridge, OHSU Knight Cancer Institute, University College London and the University of Manchester. This work was also supported by the CRUK National Cancer Imaging Translational Accelerator (NCITA) (C42780/A27066), and The Mark Foundation for Cancer Research and Cancer Research UK (CRUK) Cambridge Centre (C9685/A25177). Additional support has been provided by the Wellcome Trust Innovator Award, UK (215733/Z/19/Z) and the National Institute of Health Research (NIHR) Cambridge Biomedical Research Centre (BRC-1215-20014). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Competing interest
The authors of this manuscript declare relationships with the following companies: E.S. is a co-founder and shareholder of Lucida Medical Ltd. L.E.S. has received consulting fees from Lucida Medical Ltd. The remaining authors declare that they have no conflicts of interest to declare.
Comments
Dear Mrs Vance,
As we recently discussed via email, we are happy to submit the invited review entitled "New Approaches To the Early Detection of Renal Cancer with Artificial Intelligence in Computed Tomography" for publication in your journal, Cambridge Prisms: Precision Medicine.
The attached review is the result of an interdisciplinary collaboration between the University of Cambridge's Departments of Oncology, Radiology, and Applied Mathematics and Theoretical Physics, and it attempts to lay the scholarly foundation for the development of AI in renal cancer early detection. The development of AI tools that can automate CT analysis is thought to be vital for reducing the cost of renal cancer screening, and the success of such AI development is likely to play a decisive role in enabling renal cancer screening via CT.
We hope this review will facilitate further interdisciplinary research between radiologists, oncologists, and radiologists in the early detection of renal cancer. Initially, this review discusses existing approaches in automated renal cancer diagnosis, and methods across broader AI research, to summarise the existing state of AI in cancer analysis. We then match these methods to the unique constraints of early renal cancer detection and propose promising directions for future research that may enable AI-based early renal cancer detection via CT screening.
The primary targets of this review are clinicians with an interest in AI and data scientists with an interest in the early detection of cancer.
Thank you for your consideration, and we look forward to hearing back from you.
Yours Sincerely,
William McGough, for the authors