Show simple item record

dc.contributor.authorDip, Solayman Alam
dc.contributor.authorHossain, Md. Tahmid
dc.contributor.authorTanha, Fatin Fawjia
dc.date.accessioned2026-05-13T07:59:06Z
dc.date.available2026-05-13T07:59:06Z
dc.date.issued2026-04
dc.identifier.urihttps://ar.iub.edu.bd/handle/11348/1201
dc.description.abstractMachine learning (ML) is a popular tool for diabetes prediction. Medical datasets, however, contain a majority of healthy patients and a minority of diabetic patients, creating a severe class imbalance. Researchers use synthetic resampling to fix this imbalance. A common error is applying this resampling before separating the test data. This pre-split approach causes data leakage and creates a false illusion of high performance. Further-more, studies depend on standard accuracy, a metric that often hides the failure of a model to detect actual diabetic cases. This study examined data leakage, metric reliability, and model explainability across three public datasets: PIMA indians diabetes dataset, Behavioral Risk Factor Surveil- lance System 2015 (BRFSS-2015), and Diahealth. The research compared pre-split and post-split resampling pipelines using eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Naive Bayes models. SHAP (SHapley Additive exPlanations) was then applied to evaluate how data leakage alters the clinical features the models use to identify diabetic patients. The experiments prove that pre-split resampling inflates performance scores by 10 to 20%. The post-split pipelines generated lower numbers but provided an honest assessment of real-world capability. The results also demonstrate that standard accuracy is a misleading metric for imbalanced data. The Matthews Correlation Coefficient (MCC) and Recall offer a safer and more realistic evaluation. At the end, the SHAP analysis revealed that data leakage distorts the internal logic of the model. Pre-split pipelines prioritized noisy or indirect variables, while the corrected post-split pipelines isolated expected clinical markers like Glucose, Body Mass Index (BMI), and Age. Resampling must occur after the train-test split to ensure a valid evaluation. To guarantee patient safety, future diagnostic research needs to prioritize MCC over standard accuracy and incorporate explainability tools to verify clinical reasoning.en_US
dc.language.isoenen_US
dc.publisherIUBen_US
dc.subjectDiabetes Predictionen_US
dc.subjectMachine Learningen_US
dc.subjectClass Imbalanceen_US
dc.subjectData Leakageen_US
dc.subjectSynthetic Resamplingen_US
dc.subjectImbalanced Medical Dataen_US
dc.subjectXGBoosten_US
dc.subjectLightGBMen_US
dc.subjectExplainable Artificial Intelligence (XAI)en_US
dc.subjectMedical Data Miningen_US
dc.subjectHealthcare Analyticsen_US
dc.subjectPredictive Modelingen_US
dc.subjectFeature Importance Analysisen_US
dc.subjectDiabetes Diagnosisen_US
dc.subjectPublic Health Informaticsen_US
dc.titleEmpirical Evidence on Data Leakage from Improper Resampling Practices in Diabetes Careen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


Copyright © 2002-2021  IUB Academic Repository.
Maintained by  Library Information Technology (LIT)
LIT