Letter to the Editor. Importance of calibration assessment in machine learning–based predictive analytics

View More View Less
  • 1 Machine Intelligence in Clinical Neuroscience (MICN) Laboratory, University Hospital Zurich, Clinical Neuroscience Centre, University of Zurich, Switzerland
  • | 2 RWTH Aachen University, Aachen, Germany
Free access

TO THE EDITOR: With great interest, we recently read the article by Hopkins et al.,1 in which the authors report on a predictive model for 30-day readmission after posterior lumbar fusion (Hopkins BS, Yamaguchi JT, Garcia R, et al. Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine. 2020;32(3):399–406). They had access to 23,264 patients for training of their model—a deep neural network. Among these patients, 1199 (5.15%) were readmitted. The authors internally evaluated their model in 20-fold cross-validation. They calculated a mean area under the curve (AUC) of 0.812 with sensitivity of 35.5% and a mean specificity of 99.5%. We highly commend the authors on their work in applying machine learning to this clinically important question. However, the authors do not assess the calibration of their predictive model.

When evaluating machine learning models for diagnosis or prediction of binary outcomes, two dimensions of performance need to be considered: First is discrimination—a model’s ability to make correct binary prediction—which is commonly assessed using AUC, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and the F1 score. Second is calibration—the degree to which a model’s predicted probability (ranging from 0% to 100%) corresponds to the actually observed incidence of the binary endpoint, which is commonly assessed using calibration curves, calibration slope and intercept, Brier score, the expected/observed ratio, and the Hosmer-Lemeshow test.2

While discrimination is practically always reported, many publications do not report calibration. While high discrimination measures and good calibration often coincide, such as is likely the case in the abovementioned publication, excellent discrimination does not necessarily mean that calibration is employable.3 Deep neural networks are especially prone to poor calibration, often massively skewing predicted probabilities toward “extreme” values of 1% and 99%.

Especially in clinical practice, calibration is crucial. For clinicians and patients alike, a predicted probability (e.g., being able to say to the patient, “Your risk is 7%”) is much more valuable than a binary yes/no prediction. Good calibration can often be attained through application of machine learning models with appropriate complexity in relation to the classification task at hand, such as logistic regression or generalized additive models. If poor calibration is observed and the pattern of miscalibration is consistent during resampling, recalibration techniques such as Platt scaling or isotonic regression can be applied.4 Lastly, models can also be primarily trained for measures of calibration, and intercepts can be adjusted.5

In conclusion, it is critically important to assess the calibration of clinical prediction models. A minimum of calibration curve, slope, and intercept should be reported for every published model.

Disclosures

The authors report no conflict of interest.

References

  • 1

    Hopkins BS, Yamaguchi JT, Garcia R, et al. Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine. 2020;32(3):399406.

    • Search Google Scholar
    • Export Citation
  • 2

    Debray TPA, Vergouwe Y, Koffijberg H, et al. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68:279289.

    • Search Google Scholar
    • Export Citation
  • 3

    Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. http://arxiv.org/abs/1706.04599. Accessed January 30, 2020.

    • Search Google Scholar
    • Export Citation
  • 4

    Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05. New York, NY: ACM; 2005:625632. http://doi.acm.org/10.1145/1102351.1102430. Accessed January 30, 2020.

    • Search Google Scholar
    • Export Citation
  • 5

    Janssen KJM, Moons KGM, Kalkman CJ, et al. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol. 2008;61:7686.

    • Search Google Scholar
    • Export Citation
View More View Less
  • Northwestern University, Chicago, IL

Response

We greatly appreciate the feedback and constructive remarks in order to better our study. The authors have touched on a key principle of medical predictive analytics that is very interesting and relevant. While binary outcomes are oftentimes of utmost importance, it is nearly impossible to perfectly classify complex, real-world data in a binary fashion. As such, we agree that calibration is a very relevant aspect of all predictive analytical questions that is rarely touched upon and deserves more traction.

While we did not initially record calibration data from our previously published data set, we have since gone back and re-run our model with this particular question in mind. After re-running our model, we found our Brier score to be 0.045. Similarly, our calibration curve is shown in Fig. 1. As is typical for such curves, perfect calibration suggests a perfect alignment to the diagonal, whereas conservative predictions are shown in the upper quadrant and more aggressive predictions in the lower.1 Our model was trained initially using a final layer built upon a probabilistic model (sigmoid). These types of models generally tend to have fewer issues with calibration as a result when compared with various other forms of classification models (such as support vector machines or other nonlinear methods that may be nonprobabilistic).2

FIG. 1.
FIG. 1.

Calibration curve for our model. Perfect calibration is suggested by following the diagonal dashed line. Observations above the diagonal suggest more conservative predictions, whereas those below the diagonal suggest more aggressive predictions. Figure is available in color online only.

It is, however, with great interest that, while we do note a discrepancy and separation between patients predicted to need readmission and those predicted not to require readmission, the model fails to calibrate most for true positives. As such, our model tends to predict readmissions conservatively (as is reflected in our calibration curve [Fig. 1]) with average likelihood probabilities for true readmissions reaching roughly 36%. While this still allows for a large discrepancy and separation of classes from those predicted to not require readmission (with average likelihood percentages reaching only approximately a 7% likelihood), it does shed light on the fact that the variables used in our model likely only play a relatively small role in the overall recipe for readmission. This reflects the realistic nature of hospital readmissions—there are many outside and confounding factors involved that likely contribute outside of observable clinical attributes. Despite our model’s best discernment, it continues to predict readmissions on a case-by-case basis for even the strongest candidates at a modest probability of only 36% on average.

These probabilities, while relatively small, do in fact still provide useful clinical utility in classifying patients as potentially high risk. Discrimination continues to allow for quick and real-time risk stratification. Similarly, while 36% is a relatively small probability for patients testing positive, readmissions themselves are very costly and problematic. Similar to the argument and use of preventative medicine, spending a minimal amount of resources upfront despite a small underlying probability oftentimes tends to prove more cost-effective. Likewise, the large financial and emotional burden for patients required to return to the hospital is difficult for all parties involved and, if possible, should be prevented and avoided.

In conclusion, we agree with the need for more publicity of calibration as it relates to predictive analytics and as it continues to be a relevant aspect in clinical management. However, while useful, calibration needs be used cautiously and in the appropriate context as failure to treat or overtreatment can often be disproportionately costly in case an unlikely event does occur.

References

  • 1

    Brownlee J. How and when to use a calibrated classification model with scikit-learn. https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/. Accessed January 30, 2020.

    • Search Google Scholar
    • Export Citation
  • 2

    Kull M, Silva Filho TM, Flach P. Beyond sigmoids: how to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electron J Stat. 2017;11(2):50525080.

    • Search Google Scholar
    • Export Citation

Contributor Notes

Correspondence Victor E. Staartjes: victoregon.staartjes@usz.ch.

INCLUDE WHEN CITING Published online February 21, 2020; DOI: 10.3171/2019.12.SPINE191503.

Disclosures The authors report no conflict of interest.

  • View in gallery

    Calibration curve for our model. Perfect calibration is suggested by following the diagonal dashed line. Observations above the diagonal suggest more conservative predictions, whereas those below the diagonal suggest more aggressive predictions. Figure is available in color online only.

  • 1

    Hopkins BS, Yamaguchi JT, Garcia R, et al. Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine. 2020;32(3):399406.

    • Search Google Scholar
    • Export Citation
  • 2

    Debray TPA, Vergouwe Y, Koffijberg H, et al. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68:279289.

    • Search Google Scholar
    • Export Citation
  • 3

    Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. http://arxiv.org/abs/1706.04599. Accessed January 30, 2020.

    • Search Google Scholar
    • Export Citation
  • 4

    Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05. New York, NY: ACM; 2005:625632. http://doi.acm.org/10.1145/1102351.1102430. Accessed January 30, 2020.

    • Search Google Scholar
    • Export Citation
  • 5

    Janssen KJM, Moons KGM, Kalkman CJ, et al. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol. 2008;61:7686.

    • Search Google Scholar
    • Export Citation
  • 1

    Brownlee J. How and when to use a calibrated classification model with scikit-learn. https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/. Accessed January 30, 2020.

    • Search Google Scholar
    • Export Citation
  • 2

    Kull M, Silva Filho TM, Flach P. Beyond sigmoids: how to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electron J Stat. 2017;11(2):50525080.

    • Search Google Scholar
    • Export Citation

Metrics

All Time Past Year Past 30 Days
Abstract Views 181 149 0
Full Text Views 123 111 18
PDF Downloads 60 39 6
EPUB Downloads 0 0 0