| dc.description.abstract | Machine learning (ML) is a popular tool for diabetes prediction. Medical datasets, however, contain a majority of healthy patients and a minority of diabetic patients, creating a severe class imbalance. Researchers use synthetic resampling to fix this imbalance. A common error is applying this resampling before separating the test data. This pre-split approach causes data leakage and creates a false illusion of high performance. Further-more, studies depend on standard accuracy, a metric that often hides the failure of a model to detect actual diabetic cases.
This study examined data leakage, metric reliability, and model explainability across three public datasets: PIMA indians diabetes dataset, Behavioral Risk Factor Surveil-
lance System 2015 (BRFSS-2015), and Diahealth. The research compared pre-split and post-split resampling pipelines using eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Naive Bayes models. SHAP (SHapley Additive exPlanations) was then applied to evaluate how data leakage alters the clinical features the models use to identify diabetic patients. The experiments prove that pre-split resampling inflates performance scores by 10 to 20%. The post-split pipelines generated lower numbers but provided an honest assessment of real-world capability. The results also demonstrate that standard accuracy is a misleading metric for imbalanced data. The Matthews Correlation Coefficient (MCC) and Recall offer a safer and more realistic evaluation. At the end, the SHAP analysis revealed that data leakage distorts the internal logic of the model. Pre-split pipelines prioritized noisy or indirect variables, while the corrected post-split pipelines isolated expected clinical markers like Glucose, Body Mass Index (BMI), and Age. Resampling must occur after the train-test split to ensure a valid evaluation. To guarantee patient safety, future diagnostic research needs to prioritize MCC over standard accuracy and incorporate explainability tools to verify clinical reasoning. | en_US |