Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods

Olushina Olawale Awe; Babatunde Adebola Adedeji; Ronaldo Dias

doi:10.28951/bjb.v42i3.684

PDF

Published: Aug 30, 2024

DOI: https://doi.org/10.28951/bjb.v42i3.684

Keywords:

Imbalanced Data Machine Learning Jaccard Score Calibration Classification

Olushina Olawale Awe

University of Campinas

https://orcid.org/0000-0002-0442-4519

Babatunde Adebola Adedeji

Obafemi Awolowo University

https://orcid.org/0009-0002-8575-9499

Ronaldo Dias

University of Campinas

https://orcid.org/0000-0002-0436-1159

Abstract

In the context of health data classification, imbalanced and asymmetric class distributions can significantly impact the performance of machine learning models. One critical aspect affected by these issues is the reliability of class probability estimates, which are crucial for informed decision-making in healthcare applications. Instead of predicting class values directly for a classification problem, it can be more convenient to predict the probability of an observation belonging to each possible class. This research aims to address the challenges posed by imbalanced and asymmetric responses in health data classification by evaluating the effectiveness of recent calibration methods in improving class probability estimates. We propose Beta calibration techniques and the Stratified Brier score and Jaccard's Score as novel calibration methods and evaluation metrics respectively. The experimental comparison involves implementing and assessing various calibration techniques to determine their impact on model performance and calibration accuracy of simulated and healthcare datasets with varying imbalance ratios. Our results show that the Beta calibration method consistently improved the classifiers' predictive ability. The findings of this study provide valuable insights into selecting the most suitable calibration method for enhancing class probability estimates in healthcare-related machine learning tasks.

How to Cite

Awe, O. O., Adedeji, B. A., & Dias, R. (2024). Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods. Brazilian Journal of Biometrics, 42(3), 225–244. https://doi.org/10.28951/bjb.v42i3.684

Issue

Vol. 42 No. 3 (2024)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

References

Alfhaid, M. A. & Abdullah, M. Classification of imbalanced data stream: Techniques and challenges. Artificial Intelligence 9, 36–52 (2021). http://dx.doi.org/10.14738/tmlai.92.9964.

Ali, A., Shamsuddin, S. M. & Ralescu, A. L. Classification with class imbalance problem. Int. J. Advance Soft Compu. Appl 5, 176–204 (2013).

Allikivi, M.-L. & Kull, M. Non-parametric Bayesian isotonic calibration: Fighting over-confidence in binary classification in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2019), 103–120. http://dx.doi.org/10.1007/978-3-030-46147-8_7

Awe, O. O., Dukhi, N. & Dias, R. Shrinkage heteroscedastic discriminant algorithms for classifying multi-class high dimensional data: Insights froma national health survey. Machine Learning with Applications 12, 100459 (2023). https://doi.org/10.1016/j.mlwa.2023.100459

Dukhi, N., Sewpaul, R., Sekgala, M. D. & Awe, O. O. Artificial intelligence approach for analyzing anemia prevalence in children and adolescents in BRICS countries: a review. Current Research in Nutrition and Food Science Journal 9, 01–10 (2021). http://dx.doi.org/10.12944/CRNFSJ.9.1.01

Fernández, A., Garcia, S., Herrera, F. & Chawla, N. V. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 61, 863–905 (2018). http://dx.doi.org/10.1613/jair.1.11192

Flach, P. Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward in Proceedings of the AAAI conference on artificial intelligence 33 (2019), 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808

Fu, S., Su, D., Li, S., Sun, S. & Tian, Y. Linear-exponential loss incorporated deep learning for imbalanced classification. ISA Transactions (2023). https://doi.org/10.1016/j.isatra.2023.06.016

Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction (Springer, 2009). http://dx.doi.org/10.1007/BF02985802

Kull, M., Silva Filho, T. & Flach, P. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers in Artificial Intelligence and Statistics (2017), 623–631.

Li, S., Zhang, H., Ma, R., Zhou, J., Wen, J. & Zhang, B. Linear discriminant analysis with generalized kernel constraint for robust image classification. Pattern Recognition 136, 109196 (2023). https://doi.org/10.1016/j.patcog.2022.109196

Mahmudah, K. R., Indriani, F., Takemori-Sakai, Y., Iwata, Y., Wada, T. & Satou, K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences 11, 7825 (2021). https://doi.org/10.3390/app11177825

More, A. S. & Rana, D. P. in Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance 1–22 (IGI Global, 2021).

Mukhiddinov, M., Muminov, A. & Cho, J. Improved classification approach for fruits and vegetables freshness based on deep learning. Sensors 22, 8192 (2022). https://doi.org/10.3390/s22218192

Naeini, M. P. & Cooper, G. F. Binary classifier calibration using an ensemble of near isotonic regression models in 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016), 360–369. https://doi.org/10.1109/ICDM.2016.0047

Noble, W. S. What is a support vector machine? Nature biotechnology 24, 1565–1567 (2006). https://doi.org/10.1038/nbt1206-1565

Pan, Z., Gu, Z., Jiang, X., Zhu, G. & Ma, D. A modular approximation methodology for efficient fixed-point hardware implementation of the sigmoid function. IEEE Transactions on Industrial Electronics 69, 10694–10703 (2022). https://doi.org/10.1109/TIE.2022.3146573

Panigrahi, R., Borah, S., Bhoi, A. K., Ijaz, M. F., Pramanik, M., Kumar, Y. & Jhaveri, R. H. A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics 9, 751 (2021). https://doi.org/10.3390/math9070751

Rufibach, K. Use of Brier score to assess binary predictions. Journal of clinical epidemiology 63, 938–939 (2010). https://doi.org/10.1016/j.jclinepi.2009.11.009

Safavian, S. R. & Landgrebe, D. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 660–674 (1991). https://doi.org/10.1109/21.97458

Shalev-Shwartz, S. & Ben-David, S. Understanding machine learning: From theory to algorithms (Cambridge university press, 2014).

Ugarković, A. & Oreški, D. Supervised and Unsupervised Machine Learning Approaches on Class Imbalanced Data in 2022 International Conference on Smart Systems and Technologies (SST) (2022), 159–162. https://doi.org/10.1109/SST55530.2022.9954646

Van den Goorbergh, R., van Smeden, M., Timmerman, D. & Van Calster, B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association 29, 1525–1534 (2022). https://doi.org/10.1093/jamia/ocac093

Wallace, B. C. & Dahabreh, I. J. Improving class probability estimates for imbalanced data. Knowledge and information systems 41, 33–52 (2014). https://doi.org/10.1007/s10115-013-0670-6

Zhou, Q., Qi, Y., Tang, H. &Wu, P. Machine learning-based processing of unbalanced data sets for computer algorithms. Open Computer Science 13, 20220273 (2023). https://doi.org/10.1515/comp-2022-0273

Article Sidebar

Main Article Content

Abstract

Article Details

References