Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods
Main Article Content
Abstract
In the context of health data classification, imbalanced and asymmetric class distributions can significantly impact the performance of machine learning models. One critical aspect affected by these issues is the reliability of class probability estimates, which are crucial for informed decision-making in healthcare applications. Instead of predicting class values directly for a classification problem, it can be more convenient to predict the probability of an observation belonging to each possible class. This research aims to address the challenges posed by imbalanced and asymmetric responses in health data classification by evaluating the effectiveness of recent calibration methods in improving class probability estimates. We propose Beta calibration techniques and the Stratified Brier score and Jaccard's Score as novel calibration methods and evaluation metrics respectively. The experimental comparison involves implementing and assessing various calibration techniques to determine their impact on model performance and calibration accuracy of simulated and healthcare datasets with varying imbalance ratios. Our results show that the Beta calibration method consistently improved the classifiers' predictive ability. The findings of this study provide valuable insights into selecting the most suitable calibration method for enhancing class probability estimates in healthcare-related machine learning tasks.
Article Details
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
References
Alfhaid, M. A. & Abdullah, M. Classification of imbalanced data stream: Techniques and challenges. Artificial Intelligence 9, 36–52 (2021). http://dx.doi.org/10.14738/tmlai.92.9964.
Ali, A., Shamsuddin, S. M. & Ralescu, A. L. Classification with class imbalance problem. Int. J. Advance Soft Compu. Appl 5, 176–204 (2013).
Allikivi, M.-L. & Kull, M. Non-parametric Bayesian isotonic calibration: Fighting over-confidence in binary classification in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2019), 103–120. http://dx.doi.org/10.1007/978-3-030-46147-8_7
Awe, O. O., Dukhi, N. & Dias, R. Shrinkage heteroscedastic discriminant algorithms for classifying multi-class high dimensional data: Insights froma national health survey. Machine Learning with Applications 12, 100459 (2023). https://doi.org/10.1016/j.mlwa.2023.100459
Dukhi, N., Sewpaul, R., Sekgala, M. D. & Awe, O. O. Artificial intelligence approach for analyzing anemia prevalence in children and adolescents in BRICS countries: a review. Current Research in Nutrition and Food Science Journal 9, 01–10 (2021). http://dx.doi.org/10.12944/CRNFSJ.9.1.01
Fernández, A., Garcia, S., Herrera, F. & Chawla, N. V. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 61, 863–905 (2018). http://dx.doi.org/10.1613/jair.1.11192
Flach, P. Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward in Proceedings of the AAAI conference on artificial intelligence 33 (2019), 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808
Fu, S., Su, D., Li, S., Sun, S. & Tian, Y. Linear-exponential loss incorporated deep learning for imbalanced classification. ISA Transactions (2023). https://doi.org/10.1016/j.isatra.2023.06.016
Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction (Springer, 2009). http://dx.doi.org/10.1007/BF02985802
Kull, M., Silva Filho, T. & Flach, P. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers in Artificial Intelligence and Statistics (2017), 623–631.
Li, S., Zhang, H., Ma, R., Zhou, J., Wen, J. & Zhang, B. Linear discriminant analysis with generalized kernel constraint for robust image classification. Pattern Recognition 136, 109196 (2023). https://doi.org/10.1016/j.patcog.2022.109196
Mahmudah, K. R., Indriani, F., Takemori-Sakai, Y., Iwata, Y., Wada, T. & Satou, K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences 11, 7825 (2021). https://doi.org/10.3390/app11177825
More, A. S. & Rana, D. P. in Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance 1–22 (IGI Global, 2021).
Mukhiddinov, M., Muminov, A. & Cho, J. Improved classification approach for fruits and vegetables freshness based on deep learning. Sensors 22, 8192 (2022). https://doi.org/10.3390/s22218192
Naeini, M. P. & Cooper, G. F. Binary classifier calibration using an ensemble of near isotonic regression models in 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016), 360–369. https://doi.org/10.1109/ICDM.2016.0047
Noble, W. S. What is a support vector machine? Nature biotechnology 24, 1565–1567 (2006). https://doi.org/10.1038/nbt1206-1565
Pan, Z., Gu, Z., Jiang, X., Zhu, G. & Ma, D. A modular approximation methodology for efficient fixed-point hardware implementation of the sigmoid function. IEEE Transactions on Industrial Electronics 69, 10694–10703 (2022). https://doi.org/10.1109/TIE.2022.3146573
Panigrahi, R., Borah, S., Bhoi, A. K., Ijaz, M. F., Pramanik, M., Kumar, Y. & Jhaveri, R. H. A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics 9, 751 (2021). https://doi.org/10.3390/math9070751
Rufibach, K. Use of Brier score to assess binary predictions. Journal of clinical epidemiology 63, 938–939 (2010). https://doi.org/10.1016/j.jclinepi.2009.11.009
Safavian, S. R. & Landgrebe, D. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 660–674 (1991). https://doi.org/10.1109/21.97458
Shalev-Shwartz, S. & Ben-David, S. Understanding machine learning: From theory to algorithms (Cambridge university press, 2014).
Ugarković, A. & Oreški, D. Supervised and Unsupervised Machine Learning Approaches on Class Imbalanced Data in 2022 International Conference on Smart Systems and Technologies (SST) (2022), 159–162. https://doi.org/10.1109/SST55530.2022.9954646
Van den Goorbergh, R., van Smeden, M., Timmerman, D. & Van Calster, B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association 29, 1525–1534 (2022). https://doi.org/10.1093/jamia/ocac093
Wallace, B. C. & Dahabreh, I. J. Improving class probability estimates for imbalanced data. Knowledge and information systems 41, 33–52 (2014). https://doi.org/10.1007/s10115-013-0670-6
Zhou, Q., Qi, Y., Tang, H. &Wu, P. Machine learning-based processing of unbalanced data sets for computer algorithms. Open Computer Science 13, 20220273 (2023). https://doi.org/10.1515/comp-2022-0273