A Machine Learning Approach for Multiclass Classification of Diabetes
Keywords:
Diabetes, Machine Learning, Logistic Regression, SVM, Decision Tree, Random Forest, Gradient BoostingAbstract
Diabetes is an increasing global health issue, with millions at risk due to factors like lifestyle, genetics, and other health conditions. Early diagnosis is essential for timely treatment, avoiding complications, and easing the strain on healthcare systems. The disease’s complexity, with its different stages, requires advanced models that can distinguish between diabetic, non-diabetic, and pre-diabetic individuals. This study aimed to develop a precise multiclass classification model to predict a patient’s diabetes status based on various health indicators. In addition to standard factors like blood sugar level, BMI, cholesterol, and age, external risk factors have also been considered for better accuracy. In the current study, the target variable categorizes patients as Diabetic, Non-Diabetic, or Pre-Diabetic. The current work applies Logistic Regression, SVM, Decision Tree, Random Forest, and Gradient Boosting models to address the classification challenge. After training and testing the models, Random Forest has been identified to deliver the highest accuracy at 98%, outperforming the others. These findings highlight the power of machine learning in effectively classifying patients based on diabetes status.
Downloads
References
Ahmed, U., Issa, G. F., Khan, M. A., Aftab, S., Khan, M. F., Said, R. A. T., Ghazal, T. M., and Ahmad, M. (2022). Prediction of diabetes empowered with fused machine learning. IEEE Access, 10, 8529–8538. https://doi.org/10.1109/ACCESS.2022.3142097.
Anjana, R. M., Unnikrishnan, R., Deepa, M., Pradeepa, R., Tandon, N., Das, A. K., Joshi, S., Bajaj, S., Jabbar, P. K., Das, H. K., et al. (2023). Metabolic non-communicable disease health report of india: the icmr- indiab national cross-sectional study (icmr-indiab-17). The Lancet Diabetes & Endocrinology, 11(7), 474–489. https://doi.org/10.1016/S2213-8587(23)00119-5.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324. Brown, L. et al. (2020). Efficiency of logistic regression in clinical prediction models. Computers in Biology and Medicine, 60, 45–52.
Bukhari, M. M. Alkhamees, B. F., Hussain, S., Gumaei, A., Assiri, A., and Ullah, S. S. (2021). An improved artificial neural network model for effective diabetes prediction. Complexity, 2021(1), 5525271. https://doi.org/10.1155/2021/5525271.
Butt, U. M., Letchmunan, S., Ali, M., Hassan, F. H., Baqir, A., and Sherazi, H. H. R. (2021). Machine learning based diabetes classification and prediction for healthcare applications. Journal of Healthcare Engineering, 2021, 1–17. https://doi.org/10.1155/2021/9930985.
Carstensen, D. H. et al. (2020). Age as a determinant in the risk of developing type 2 diabetes: A population-based cohort study. Diabetes Care, 35(5), 999–1003.
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. https://doi.org/10.1145/2939672.2939785.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964.
Davis, M. and Lee, K. (2019). Logistic regression as a baseline for predictive modeling. Statistical Methods in Health Research, 10(1), 10–18.
Efron, B. (2001). Random forests. Machine Learning, 45(1), 5–32.
Fox, J. A. and Flegal, M. T. (2023). Bmi as a risk factor for diabetes incidence. International Journal of Obesity, 37, 1019–1027.
Fruchart, J. H. et al. (2021). Significance of fasting lipid profile in diabetes risk assessment. Diabetes Metabolism, 46(4), 289–297.
Gerstein, M. N. and Ong, A. Y. (2024). Cholesterol and cardiovascular risk in diabetes. Journal of the American College of Cardiology, 71(19), 2330–2340.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://doi.org/10.1007/s10710- 017-9314-z.
Guyon, I. and Gunn, K. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. https://doi.org/10.1162/153244303322753616.
Hasan, M. K., Alam, M. A., Das, D., Hossain, E., and Hasan, M. (2020). Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access, 8, 76516–76531. https://doi.org/10.1109/ACCESS.2020.2989857.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer. https://doi.org/10.1007/978-0-387-84858-7.
He, S. H. and Zhang, Y. (2023). A comprehensive survey on gradient boosting machines. IEEE Transactions on Knowledge and Data Engineering, 29(10), 2100–2113.
Huang, Y. and Li, T. (2021). Robustness of support vector machines in high-dimensional spaces. Pattern Recognition Letters, 35, 200–206. https://doi.org/10.1016/j.patcog.2024.110544.
Islam Ayon, S. and Milon Islam, M. (2019). Diabetes prediction: A deep learning approach. International Journal of Information Engineering and Electronic Business, 11(2), 21–27. https://doi.org/10.5815/ijieeb.2019.02.03. Jayanthi, N., Babu, B. V., and Rao, N. S. (2017). Survey on clinical prediction models for diabetes prediction. Journal of Big Data, 4(1). https://doi.org/10.1186/s40537-017-0082-7.
Kautzky-Willer, E. G. et al. (2022). Gender differences in diabetes and cardiovascular risk. Nature Reviews Endocrinology, 11, 1–12. https://doi.org/10.1007/s00125-023-05891-x.
Khanam, J. J. and Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432–439. https://doi.org/10.1016/j.icte.2021.02.004.
Koye, A. L. M. et al. (2023). The link between creatinine levels and renal disease in diabetes patients. Journal of Diabetes Complications, 35(8), 556–564.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521, 436–444. https://doi.org/10.1038/nature14539.
Liang, X., Alshemmary, E. N., Ma, M., Liao, S., Zhou, W., and Lu, Z. (2021). Automatic diabetic foot prediction through fundus images by radiomics features. IEEE Access, 9, 92776–92787. https://doi.org/10.1109/ACCESS.2021.3093358.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22. Macisaac, S. P. et al. (2022). Elevated urea as a predictor of renal disease in diabetic populations. Diabetologia, 65, 345–352.
Mahboob Alam, T., Iqbal, M. A., Ali, Y., Wahab, A., Ijaz, S., Imtiaz Baig, T., Hussain, A., Malik, M. A., Raza, M. M., Ibrar, S., and Abbas, Z. (2019). A model for early prediction of diabetes. Informatics in Medicine Unlocked, 16, 100204. https://doi.org/10.1016/j.imu.2019.100204.
Mujumdar, A. and Vaidehi, V. (2019). Diabetes prediction using machine learning algorithms. Procedia Computer Science, 165, 292–299. https://doi.org/10.1016/j.procs.2020.01.047.
Nai-arun, N. and Moungmai, R. (2015). Comparison of classifiers for the risk of diabetes prediction. Procedia Computer Science, 69, 132–142. https://doi.org/10.1016/j.procs.2015.10.014.
Qiao, L., Zhu, Y., and Zhou, H. (2020). Diabetic retinopathy detection using prognosis of microaneurysm and early diagnosis system for non-proliferative diabetic retinopathy based on deep learning algorithms. IEEE Access, 8, 104292–104302. https://doi.org/10.1109/ACCESS.2020.2993937.
Quinlan, R. (2014). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Rao, B. G. N. and Narasimhan, L. R. (2003). Naive bayes vs. decision trees. Journal of Statistical Computation and Simulation, 73(4), 397–412.
Rennie, N. (1997). On the bias-variance tradeoff for linear and naive bayes classifiers. In Proceedings of the 14th International Conference on Machine Learning, pages 299–306.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. https://doi.org/10.1145/2939672.2939778.
Shi, C. and Zhang, Z. (2021). Handling missing data in gradient boosting. Data Mining and Knowledge Discovery, 31(4), 1023–1039. https://doi.org/10.1088/1742-6596/1684/1/012062.
Sinha, R. W. and Lipton, R. M. (2021). Glucose levels and diabetes risk. Diabetes Care, 27, 573–582.
Smith, J. and Johnson, A. (2021). Application of logistic regression in predicting diabetes. Journal of Medical Informatics, 45(2), 123–130.
Vapnik, P. (2013). The Nature of Statistical Learning Theory. Springer. https://doi.org/10.1016/j.patcog.2024.110544.
Wilson, J. D. and Porter, M. C. (2022). Glycated hemoglobin as a predictive marker for diabetes. The Lancet Diabetes Endocrinology, 9(3), 204–212. https://doi.org/10.1093/ajcn/nqac154.
Xue, J., Min, F., and Ma, F. (2020). Research on diabetes prediction method based on machine learning. Journal of Physics: Conference Series, 1684, 012062. https://doi.org/10.1088/1742-6596/1684/1/012062.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Academic Publishing House (IAPH)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.