Statistical Challenges in 21st Century Cosmology Proceedings IAU Symposium No. 306, 2014 Alan Heavens, Jean-Luc Starck & Alberto Krone-Martins, eds. c 2014 International Astronomical Union DOI: 00.0000/X000000000000000X Adapting Predictive Models for Cepheid Variable Star Classification Using Linear Regression and Maximum Likelihood Kinjal Dhar Gupta 1 , Ricardo Vilalta 1 , Vicken Asadourian 2 and Lucas Macri 3 1 Dept. of Computer Science, University of Houston. 2 Dept. of Mathematics, University of Houston. 4800 Calhoun Road , Houston TX-70004, USA. email: kinjal13@cs.uh.edu, vilalta@cs.uh.edu, vmasadourian@uh.edu 3 Dept. of Physics and Astronomy, Texas A&M University. 4242 TAMU , College Station, TX 77843-4242, USA. email: lmacri@tamu.edu Abstract. We describe an approach to automate the classification of Cepheid variable stars into two subtypes according to their pulsation mode. Automating such classification is relevant to obtain a precise determination of distances to nearby galaxies, which in addition helps reduce the uncertainty in the current expansion of the universe. One main difficulty lies in the compatibility of models trained using different galaxy datasets; a model trained using a training dataset may be ineffectual on a testing set. A solution to such difficulty is to adapt predictive models across domains; this is necessary when the training and testing sets do not follow the same distribution. The gist of our methodology is to train a predictive model on a nearby galaxy (e.g., Large Magellanic Cloud), followed by a model-adaptation step to make the model operable on other nearby galaxies. We follow a parametric approach to density estimation by modeling the training data (anchor galaxy) using a mixture of linear models. We then use maximum likelihood to compute the right amount of variable displacement, until the testing data closely overlaps the training data. At that point, the model can be directly used in the testing data (target galaxy). Keywords. (stars: variables:) Cepheids, (galaxies:) Magellanic Clouds, methods: statistical, infrared: stars, methods: data analysis. 1. Introduction Traditional machine learning algorithms assume both training and testing data origi- nate from the same distribution. This comes unwarranted in real-world applications. One approach to handle the discrepancy between source (training) and target (test) domains is called domain adaptation, where class-conditional distributions remain equal, though class prior distributions differ Ben-David et al. (2006), Storkey (2009), Ben-David et al. (2010). Our domain adaptation method learns a model on a source domain without using any information from a target domain; we assume equal class conditional probabilities, but class priors differ by a certain shift across one or more features as explained by Vilalta et al. (2013). Different from previous work, we assume a bi-variate Linear Mixture Model with Gaussian noise, and use Maximum Likelihood to find the shift between source and target distributions. The idea is to align the two datasets so that a model learnt on the source domain can be effectively used on the target domain. We classify a particular type of variable stars named Cepheids into two pulsation 1