Empirical Evidence on Data Leakage from Improper  Resampling Practices in Diabetes Care

Dip, Solayman Alam; Hossain, Md. Tahmid; Tanha, Fatin Fawjia

dc.contributor.author	Dip, Solayman Alam
dc.contributor.author	Hossain, Md. Tahmid
dc.contributor.author	Tanha, Fatin Fawjia
dc.date.accessioned	2026-05-13T07:59:06Z
dc.date.available	2026-05-13T07:59:06Z
dc.date.issued	2026-04
dc.identifier.uri	https://ar.iub.edu.bd/handle/11348/1201
dc.description.abstract	Machine learning (ML) is a popular tool for diabetes prediction. Medical datasets, however, contain a majority of healthy patients and a minority of diabetic patients, creating a severe class imbalance. Researchers use synthetic resampling to fix this imbalance. A common error is applying this resampling before separating the test data. This pre-split approach causes data leakage and creates a false illusion of high performance. Further-more, studies depend on standard accuracy, a metric that often hides the failure of a model to detect actual diabetic cases. This study examined data leakage, metric reliability, and model explainability across three public datasets: PIMA indians diabetes dataset, Behavioral Risk Factor Surveil- lance System 2015 (BRFSS-2015), and Diahealth. The research compared pre-split and post-split resampling pipelines using eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Naive Bayes models. SHAP (SHapley Additive exPlanations) was then applied to evaluate how data leakage alters the clinical features the models use to identify diabetic patients. The experiments prove that pre-split resampling inflates performance scores by 10 to 20%. The post-split pipelines generated lower numbers but provided an honest assessment of real-world capability. The results also demonstrate that standard accuracy is a misleading metric for imbalanced data. The Matthews Correlation Coefficient (MCC) and Recall offer a safer and more realistic evaluation. At the end, the SHAP analysis revealed that data leakage distorts the internal logic of the model. Pre-split pipelines prioritized noisy or indirect variables, while the corrected post-split pipelines isolated expected clinical markers like Glucose, Body Mass Index (BMI), and Age. Resampling must occur after the train-test split to ensure a valid evaluation. To guarantee patient safety, future diagnostic research needs to prioritize MCC over standard accuracy and incorporate explainability tools to verify clinical reasoning.	en_US
dc.language.iso	en	en_US
dc.publisher	IUB	en_US
dc.subject	Diabetes Prediction	en_US
dc.subject	Machine Learning	en_US
dc.subject	Class Imbalance	en_US
dc.subject	Data Leakage	en_US
dc.subject	Synthetic Resampling	en_US
dc.subject	Imbalanced Medical Data	en_US
dc.subject	XGBoost	en_US
dc.subject	LightGBM	en_US
dc.subject	Explainable Artificial Intelligence (XAI)	en_US
dc.subject	Medical Data Mining	en_US
dc.subject	Healthcare Analytics	en_US
dc.subject	Predictive Modeling	en_US
dc.subject	Feature Importance Analysis	en_US
dc.subject	Diabetes Diagnosis	en_US
dc.subject	Public Health Informatics	en_US
dc.title	Empirical Evidence on Data Leakage from Improper Resampling Practices in Diabetes Care	en_US
dc.type	Thesis	en_US

Files in this item

Name:: SP_Thesis_Draft_Dip_Tahmid_Fatin ...
Size:: 12.06Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Undergraduate Thesis [44]
By CSE Department

Show simple item record