Non-parametric regression models for compositional data Michail Tsagris 1 , Abdulaziz Alenazi 2 and Connie Stewart 3 1 Department of Economics, University of Crete, Rethymnon, Crete, Greece 2 Department of Mathematics, Northern Border University, Arar, Saudi Arabia 3 Department of Mathematics and Statistics, University of New Brunswick, Saint John, Canada May 13, 2021 Abstract Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. This paper, through use of the α-transformation, extends the classical k-NN regression to what is termed α-k-NN regression, yielding a highly ﬂexible non-parametric regression model for compositional data. The α-k-NN is further extended to the α-kernel regression by adopting the Nadaray-Watson estimator. Unlike many of the recommended regression mod- els for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed models without modiﬁcation. Extensive simulation studies and real-life data analyses highlight the advantage of using these non-parametric regressions for complex rela- tionships between the compositional response data and Euclidean predictor variables. Both suggest that α-k-NN and α-kernel regressions can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric relationship with the predictor variables. In addition, the α-k-NN regression, in contrast to α-kernel regression, enjoys a high computational eﬃ- ciency rendering it highly attractive for use with large scale, massive, or big data. Keywords: compositional data, regression, α-transformation, k-NN algorithm, kernel regression 1 Introduction Non-negative multivariate vectors with variables (typically called components) conveying only relative information are referred to as compositional data. When the vectors are normalized to sum to 1, their sample space is the standard simplex given below S D−1 =  (u 1 , ..., u D ) ⊤     u i ≥ 0, D  i=1 u i =1  , (1) where D denotes the number of components. Examples of compositional data may be found in many diﬀerent ﬁelds of study and the extensive scientiﬁc literature that has been published on the proper analysis of this type of data is indicative of its prevalence in real-life applications 1 It is perhaps not surprising, given the widespread occurrence of this type of data, that many compositional data analysis applications involve covariates. In sedimentology, for example, samples were collected from an Arctic lake and the change in their chemical composition 1 For a substantial number of speciﬁc examples of applications involving compositional data see (Tsagris and Stewart, 2020). 1 arXiv:2002.05137v3 [stat.ME] 12 May 2021