1 Vol.:(0123456789) Scientifc Reports | (2022) 12:19999 | https://doi.org/10.1038/s41598-022-22011-8 www.nature.com/scientificreports Prediction of β‑Thalassemia carriers using complete blood count features Furqan Rustam 1 , Imran Ashraf 2* , Shehbaz Jabbar 3 , Kilian Tutusaus 4,7,8 , Cristina Mazas 4,5 , Alina Eugenia Pascual Barrera 4,5,6 & Isabel de la Torre Diez 9* β‑Thalassemia is one of the dangerous causes of the high mortality rate in the Mediterranean countries. Substantial resources are required to save a β‑Thalassemia carriers’ life and early detection of thalassemia patients can help appropriate treatment to increase the carrier’s life expectancy. Being a genetic disease, it can not be prevented however the analysis of several indicators in parents’ blood can be used to detect disorders causing Thalassemia. Laboratory tests for Thalassemia are time‑consuming and expensive like high‑performance liquid chromatography, Complete Blood Count (CBC) with peripheral smear, genetic test, etc. Red blood indices from CBC can be used with machine learning models for the same task. Despite the available approaches for Thalassemia carriers from CBC data, gaps exist between the desired and achieved accuracy. Moreover, the data imbalance problem is studied well which makes the models less generalizable. This study proposes a highly accurate approach for β‑Thalassemia detection using red blood indices from CBC augmented by supervised machine learning. In view of the fact that all the features do not carry predictive information regarding the target variable, this study employs a unifed framework of two features selection techniques including Principal Component Analysis (PCA) and Singular Vector Decomposition (SVD). The data imbalance between β‑Thalassemia carrier and non‑carriers is handled by Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN). Extensive experiments are performed using many state‑of‑the‑art machine learning models and deep learning models. Experimental results indicate the superiority of the proposed approach over existing approaches with an accuracy score of 0.96. Talassemia is a hereditary genetic disorder that occurs due to mutations in the DeoxyriboNucleic Acid (DNA) of cells induced by insufcient production of Hemoglobin (Hb) in the body. Hb is a protein that allows Red Blood Cells (RBCs) to carry oxygen. Te defciency of Hb lowers the survival rate of RBCs resulting in a smaller number of RBCs fowing through the bloodstream leading to a limited supply of oxygen in the body which can be life-threatening. Two protein chains, α, and β, are required to synthesize Hb. RBCs will not be able to carry oxygen efciently if either of the aforementioned protein chains is insufcient. Te α-Talassemia caused by less production of α-protein chain, and β-Talassemia caused by the absence or limited synthesis of β-protein chain, are the two forms of thalassemia disorder 1 . Symptoms of thalassemia range from mild to severe anemia which can cause organ damage and even death. As of today, many countries are dealing with the growing rate of thalassemia, which has signifcantly increased disability and mortality worldwide. Te β-Talassemia is the most prevalent type of thalassemia which is com- mon among the people of Mediterranean countries, hence also called ‘Mediterranean Anaemia’. Pakistan is one of the Mediterranean countries in which every year, approximately 5000–9000 children are diagnosed with β -Talassemia disorder along with an estimated 5–7% carrier rate among the total populous 2 . According to the OPEN 1 Faculty of Computer Science and Information Technology, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, Pakistan. 2 Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, Korea. 3 Sheikh Zayed Hospital and Medical College, Rahim Yar Khan 64200, Pakistan. 4 Universidad Europea del Atlántico, Isabel Torres 21, 39011 Santander, Spain. 5 Universidad Internacional Iberoamericana, 24560 Campeche, Mexico. 6 Universidad Internacional Iberoamericana Arecibo, Puerto Rico 00613, USA. 7 Universidade Internacional do Cuanza, Cuito, Bié, Angola. 8 Fundación Universitaria Internacional de Colombia Bogotá, Bogotá, Colombia. 9 Department of Signal Theory and Communications and Telematic Engineering, University of Valladolid, Paseo de Belén 15, 47011 Valladolid, Spain. * email: imranashraf@ ynu.ac.kr; isator@tel.uva.es