Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods

Main Article Content

Olushina Olawale Awe
https://orcid.org/0000-0002-0442-4519
Babatunde Adebola Adedeji
https://orcid.org/0009-0002-8575-9499
Ronaldo Dias
https://orcid.org/0000-0002-0436-1159

Abstract

In the context of health data classification, imbalanced and asymmetric class distributions can significantly impact the performance of machine learning models. One critical aspect affected by these issues is the reliability of class probability estimates, which are crucial for informed decision-making in healthcare applications. Instead of predicting class values directly for a classification problem, it can be more convenient to predict the probability of an observation belonging to each possible class. This research aims to address the challenges posed by imbalanced and asymmetric responses in health data classification by evaluating the effectiveness of recent calibration methods in improving class probability estimates. We propose Beta calibration techniques and the Stratified Brier score and Jaccard's Score as novel calibration methods and evaluation metrics respectively. The experimental comparison involves implementing and assessing various calibration techniques to determine their impact on model performance and calibration accuracy of simulated and healthcare datasets with varying imbalance ratios. Our results show that the Beta calibration method consistently improved the classifiers' predictive ability. The findings of this study provide valuable insights into selecting the most suitable calibration method for enhancing class probability estimates in healthcare-related machine learning tasks.

Article Details

How to Cite
Awe, O. O., Adedeji, B. A., & Dias, R. (2024). Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods. Brazilian Journal of Biometrics, 42(3), 225–244. https://doi.org/10.28951/bjb.v42i3.684
Section
Articles

References

Alfhaid, M. A. & Abdullah, M. Classification of imbalanced data stream: Techniques and challenges. Artificial Intelligence 9, 36–52 (2021). http://dx.doi.org/10.14738/tmlai.92.9964.

Ali, A., Shamsuddin, S. M. & Ralescu, A. L. Classification with class imbalance problem. Int. J. Advance Soft Compu. Appl 5, 176–204 (2013).

Allikivi, M.-L. & Kull, M. Non-parametric Bayesian isotonic calibration: Fighting over-confidence in binary classification in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2019), 103–120. http://dx.doi.org/10.1007/978-3-030-46147-8_7

Awe, O. O., Dukhi, N. & Dias, R. Shrinkage heteroscedastic discriminant algorithms for classifying multi-class high dimensional data: Insights froma national health survey. Machine Learning with Applications 12, 100459 (2023). https://doi.org/10.1016/j.mlwa.2023.100459

Dukhi, N., Sewpaul, R., Sekgala, M. D. & Awe, O. O. Artificial intelligence approach for analyzing anemia prevalence in children and adolescents in BRICS countries: a review. Current Research in Nutrition and Food Science Journal 9, 01–10 (2021). http://dx.doi.org/10.12944/CRNFSJ.9.1.01

Fernández, A., Garcia, S., Herrera, F. & Chawla, N. V. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 61, 863–905 (2018). http://dx.doi.org/10.1613/jair.1.11192

Flach, P. Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward in Proceedings of the AAAI conference on artificial intelligence 33 (2019), 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808

Fu, S., Su, D., Li, S., Sun, S. & Tian, Y. Linear-exponential loss incorporated deep learning for imbalanced classification. ISA Transactions (2023). https://doi.org/10.1016/j.isatra.2023.06.016

Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction (Springer, 2009). http://dx.doi.org/10.1007/BF02985802

Kull, M., Silva Filho, T. & Flach, P. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers in Artificial Intelligence and Statistics (2017), 623–631.

Li, S., Zhang, H., Ma, R., Zhou, J., Wen, J. & Zhang, B. Linear discriminant analysis with generalized kernel constraint for robust image classification. Pattern Recognition 136, 109196 (2023). https://doi.org/10.1016/j.patcog.2022.109196

Mahmudah, K. R., Indriani, F., Takemori-Sakai, Y., Iwata, Y., Wada, T. & Satou, K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences 11, 7825 (2021). https://doi.org/10.3390/app11177825

More, A. S. & Rana, D. P. in Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance 1–22 (IGI Global, 2021).

Mukhiddinov, M., Muminov, A. & Cho, J. Improved classification approach for fruits and vegetables freshness based on deep learning. Sensors 22, 8192 (2022). https://doi.org/10.3390/s22218192

Naeini, M. P. & Cooper, G. F. Binary classifier calibration using an ensemble of near isotonic regression models in 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016), 360–369. https://doi.org/10.1109/ICDM.2016.0047

Noble, W. S. What is a support vector machine? Nature biotechnology 24, 1565–1567 (2006). https://doi.org/10.1038/nbt1206-1565

Pan, Z., Gu, Z., Jiang, X., Zhu, G. & Ma, D. A modular approximation methodology for efficient fixed-point hardware implementation of the sigmoid function. IEEE Transactions on Industrial Electronics 69, 10694–10703 (2022). https://doi.org/10.1109/TIE.2022.3146573

Panigrahi, R., Borah, S., Bhoi, A. K., Ijaz, M. F., Pramanik, M., Kumar, Y. & Jhaveri, R. H. A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics 9, 751 (2021). https://doi.org/10.3390/math9070751

Rufibach, K. Use of Brier score to assess binary predictions. Journal of clinical epidemiology 63, 938–939 (2010). https://doi.org/10.1016/j.jclinepi.2009.11.009

Safavian, S. R. & Landgrebe, D. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 660–674 (1991). https://doi.org/10.1109/21.97458

Shalev-Shwartz, S. & Ben-David, S. Understanding machine learning: From theory to algorithms (Cambridge university press, 2014).

Ugarković, A. & Oreški, D. Supervised and Unsupervised Machine Learning Approaches on Class Imbalanced Data in 2022 International Conference on Smart Systems and Technologies (SST) (2022), 159–162. https://doi.org/10.1109/SST55530.2022.9954646

Van den Goorbergh, R., van Smeden, M., Timmerman, D. & Van Calster, B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association 29, 1525–1534 (2022). https://doi.org/10.1093/jamia/ocac093

Wallace, B. C. & Dahabreh, I. J. Improving class probability estimates for imbalanced data. Knowledge and information systems 41, 33–52 (2014). https://doi.org/10.1007/s10115-013-0670-6

Zhou, Q., Qi, Y., Tang, H. &Wu, P. Machine learning-based processing of unbalanced data sets for computer algorithms. Open Computer Science 13, 20220273 (2023). https://doi.org/10.1515/comp-2022-0273