1 OPEN DATA AND MACHINE LEARNING FOR BIRTH PREDICTION AND CLASSIFICATION A Case Study Utilizing Malaysia's Public Sector Open Data Portal MUHAMMAD SUKRI BIN RAMLI Asia School of Business Kuala Lumpur, Malaysia Email: m.binramli@sloan.mit.edu Abstract This study analyzes birth data in Malaysia from 2000 to 2023, employing machine learning techniques to predict birth numbers, categorize birth rate periods, and explore newborn sex prediction. The analysis utilizes Linear Regression (James et al., 2013), Random Forest Classifier (Breiman, 2001), Prophet (Taylor & Letham, 2018), and XGBoost (Chen & Guestrin, 2016) models, incorporating feature engineering and handling missing values. Results show that XGBoost outperforms Linear Regression in predicting birth numbers, achieving a lower Mean Squared Error. The study also highlights potential overfitting in the classification task and the infeasibility of predicting newborn sex based on year and ethnicity alone. Future work includes incorporating additional features, exploring more sophisticated models, and addressing overfitting to enhance prediction accuracy and understanding of birth trends in Malaysia. Figure 1: Total live births in Malaysia by ethnicity 1. Introduction This report presents an analysis of birth data in Malaysia, sourced from the nation's Official Open Data Portal (data.gov.my). This initiative is rooted in the "Pekeliling Am Bil. 1 Tahun 2015," which underscores the Malaysian government's commitment to open data, thereby enhancing transparency and fostering innovation. The dataset encompasses birth records from January 1st, 2000, to December 31st, 2023, and includes details on date, sex, ethnicity, absolute number of births, and birth rate. This comprehensive dataset facilitates the examination of demographic shifts, such as variations in birth rates across ethnicities and over time. The analysis will encompass various techniques, including machine learning models like Linear Regression, Random Forest Classifier, Prophet, and XGBoost, to predict birth numbers, categorize birth rate periods, and explore newborn sex prediction. The findings of this study have the potential to inform policy decisions and resource allocation in the healthcare sector.