Challen, R. et al. Artificial intelligence, bias and clinical safety. BMJ Qual. Saf. 28, 231–237 (2019).
Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 427–436 (2015).
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36 (2019).
Kompa, B., Snoek, J. & Beam, A. L. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digit. Med. 4, 4 (2021).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. 34th Int. Conference on Machine Learning (PMLR) 70, 1321–1330 (2017).
Dyer, T. et al. Diagnosis of normal chest radiographs using an autonomous deep-learning algorithm. Clin. Radiol. 76, 473–473 (2021).
Dyer, T. et al. Validation of an artificial intelligence solution for acute triage and rule-out normal of non-contrast CT head scans. Neuroradiology 64, 735–743 (2022).
Liang, X., Nguyen, D. & Jiang, S. B. Generalizability issues with deep learning models in medicine and their potential solutions: illustrated with Cone-Beam Computed Tomography (CBCT) to Computed Tomography (CT) image conversion. Mach. Learn. Sci. Technol. 2, 015007 (2020).
Navarrete-Dechent, C. et al. Automated dermatological diagnosis: hype or reality? J. Invest. Dermatol. 138, 2277–2279 (2018).
Krois, J. et al. Generalizability of deep learning models for dental image analysis. Sci. Rep. 11, 6102 (2021).
Sathitratanacheewin, S., Sunanta, P. & Pongpirul, K. Deep learning for automated classification of tuberculosis-related chest X-ray: dataset distribution shift limits diagnostic performance generalizability. Heliyon 6, e04614 (2020).
Xin, K. Z., Li, D. & Yi, P. H. Limited generalizability of deep learning algorithm for pediatric pneumonia classification on external data. Emerg. Radiol. 29, 107–113 (2022).
Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 15, e1002683 (2018).
Chen, J. S. et al. Deep learning for the diagnosis of stage in retinopathy of prematurity: accuracy and generalizability across populations and cameras. Ophthalmol. Retina 5, 1027–1035 (2021).
Jiang, H., Kim, B., Guan, M. & Gupta, M. To trust or not to trust a classifier. In Advances in Neural Information Processing Systems 31 (2018).
Geifman, Y. & El-Yaniv, R. Selectivenet: a deep neural network with an integrated reject option. In Proc. 36th Int. Conference on Machine Learning (PMLR) 97, 2151–2159 (2019).
Madras, D., Pitassi, T. & Zemel, R. Predict responsibly: improving fairness and accuracy by learning to defer. In Advances in Neural Information Processing Systems 31 (2018).
Kim, D. et al. Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model. Nat. Commun. 13, 1867 (2022).
Bernhardt, M. et al. Active label cleaning for improved dataset quality under resource constraints. Nat. Commun. 13, 1161 (2022).
Krause, J. et al. Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology 125, 1264–1272 (2018).
Basha, S. H. S., Dubey, S. R., Pulabaigari, V. & Mukherjee, S. Impact of fully connected layers on performance of convolutional neural networks for image classification. Neurocomputing 378, 112–119 (2020).
Trabelsi, A., Chaabane, M. & Ben-Hur, A. Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35, i269–i277 (2019).
Boland, G. W. L. Voice recognition technology for radiology reporting: transforming the radiologist’s value proposition. J. Am. Coll. Radiol. 4, 865–867 (2007).
Heleno, B., Thomsen, M. F., Rodrigues, D. S., Jorgensen, K. J. & Brodersen, J. Quantification of harms in cancer screening trials: literature review. BMJ 347, f5334–f5334 (2013).
Dans, L. F., Silvestre, M. A. A. & Dans, A. L. Trade-off between benefit and harm is crucial in health screening recommendations. Part I: general principles. J. Clin. Epidemiol. 64, 231–239 (2011).
Peryer, G., Golder, S., Junqueira, D. R., Vohra, S. & Loke, Y. K. in Cochrane Handbook for Systematic Reviews of Interventions (eds Higgins, J. P. et al.) Ch. 19, 493–505 (John Wiley & Sons, 2011).
Kruschke, J. K. in The Cambridge Handbook of Computational Psychology (ed. Sun, R.) 267–301 (Cambridge Univ. Press, 2008).
Bowman, C. R., Iwashita, T. & Zeithamova, D. Tracking prototype and exemplar representations in the brain across learning. eLife 9, e59360 (2020).
Platt, J. C. in Advances in Large Margin Classifiers (eds Smola, A. J. et al.) (MIT Press, 1999).
Ding, Z., Han, X., Liu, P. & Niethammer, M. Local temperature scaling for probability calibration. In Proc. IEEE/CVF International Conference on Computer Vision 6889–6899 (2021).
Clinciu, M.-A. & Hastie, H. A survey of explainable AI terminology. In Proc. 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI) 8–13 (2019).
Biran, O. & Cotton, C. Explanation and justification in machine learning: a survey. In IJCAI-17 Workshop on Explainable Artificial Intelligence (XAI) 8, 8–13 (2017).