TO THE EDITOR: With great interest, we recently read the article by Hopkins et al.,1 in which the authors report on a predictive model for 30-day readmission after posterior lumbar fusion (Hopkins BS, Yamaguchi JT, Garcia R, et al. Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine. 2020;32(3):399–406). They had access to 23,264 patients for training of their model—a deep neural network. Among these patients, 1199 (5.15%) were readmitted. The authors internally evaluated their model in 20-fold cross-validation. They calculated a mean area under the curve (AUC) of 0.812 with sensitivity of 35.5% and a mean specificity of 99.5%. We highly commend the authors on their work in applying machine learning to this clinically important question. However, the authors do not assess the calibration of their predictive model.
When evaluating machine learning models for diagnosis or prediction of binary outcomes, two dimensions of performance need to be considered: First is discrimination—a model’s ability to make correct binary prediction—which is commonly assessed using AUC, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and the F1 score. Second is calibration—the degree to which a model’s predicted probability (ranging from 0% to 100%) corresponds to the actually observed incidence of the binary endpoint, which is commonly assessed using calibration curves, calibration slope and intercept, Brier score, the expected/observed ratio, and the Hosmer-Lemeshow test.2
While discrimination is practically always reported, many publications do not report calibration. While high discrimination measures and good calibration often coincide, such as is likely the case in the abovementioned publication, excellent discrimination does not necessarily mean that calibration is employable.3 Deep neural networks are especially prone to poor calibration, often massively skewing predicted probabilities toward “extreme” values of 1% and 99%.
Especially in clinical practice, calibration is crucial. For clinicians and patients alike, a predicted probability (e.g., being able to say to the patient, “Your risk is 7%”) is much more valuable than a binary yes/no prediction. Good calibration can often be attained through application of machine learning models with appropriate complexity in relation to the classification task at hand, such as logistic regression or generalized additive models. If poor calibration is observed and the pattern of miscalibration is consistent during resampling, recalibration techniques such as Platt scaling or isotonic regression can be applied.4 Lastly, models can also be primarily trained for measures of calibration, and intercepts can be adjusted.5
In conclusion, it is critically important to assess the calibration of clinical prediction models. A minimum of calibration curve, slope, and intercept should be reported for every published model.
Disclosures
The authors report no conflict of interest.
References
- 1↑
Hopkins BS, Yamaguchi JT, Garcia R, Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine. 2020;32(3):399–406.
- 2↑
Debray TPA, Vergouwe Y, Koffijberg H, A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68:279–289.
- 3↑
Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. http://arxiv.org/abs/1706.04599. Accessed January 30, 2020.
- 4↑
Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05. New York, NY: ACM; 2005:625–632. http://doi.acm.org/10.1145/1102351.1102430. Accessed January 30, 2020.
- 5↑
Janssen KJM, Moons KGM, Kalkman CJ, Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol. 2008;61:76–86.