Evaluation of Non-linearity in MIR Spectroscopic Data for Compressed Learning Dixon Vimalajeewa, Donagh Berry, Eric Robson, Chamil Kulatunga Telecommunications Software and Systems Group, Waterford Institute of Technology, Waterford, Ireland Teagasc, Animal & Grassland Research and Innovation Centre, Moorepark, Fermoy, Co. Cork, Ireland Email: dvimalajeewa@tssg.org, Donagh.Berry@teagasc.ie, {erobson, ckulatunga}@tssg.org Abstract—Mid-Infrared (MIR) spectroscopy has emerged as the most economically viable technology to determine milk values as well as to identify a set of animal phenotypes related to health, feeding, well-being and environment. However, Fourier transform-MIR spectra incurs a signiﬁcant amount of redundant data. This creates critical issues such as increased learning complexity while performing Fog and Cloud based data analytics in smart farming. These issues can be resolved through data compression using unsuper- visory techniques like PCA, and perform analytics in the compressed-domain i.e. without de-compressing. Compres- sion algorithms should preserve non-linearity of MIRS data (if exists), since emerging advanced learning algorithms can improve their prediction accuracy. This study has in- vestigated the non-linearity between the feature variables in the measurement-domain as well as in two compressed domains using standard Linear PCA and Kernel PCA. Also the non-linearity between the feature variables and the commonly used target milk quality parameters (Protein, Lactose, Fat) has been analyzed. The study evaluates the prediction accuracy using PLS and LS-SVM respectively as linear and non-linear predictive models. 1. Introduction Advances in pervasive computation and communi- cation technologies with IoT systems result in rapid adoption of Fog/Edge computing based data analytics to discover near real-time insights in smart farming [1]. The opportunity of collecting and analyzing millions of high-resolution data demands distributed analytics across the resource-constrained Fog devices rather than centralizing raw data. Therefore efﬁcient data storage, communication and processing techniques are vital [2] in Distributed Learning (DL) [6] compared to learning by centralizing data of such applications. This is not only because of scalability, but also due to signiﬁcant contributions towards energy optimization [3], [12]. Instead of aggregating raw data, DL aggregates rich features from each data source to discover high quality global knowledge. The success of DL depends on the accuracy of knowledge aggregation at the same level where centralized learning could achieve. Therefore, one of the important task in DL is to prepare data in a compressed feature space that enables to maximize information extraction while minimizing computation, communication and storage resource consumption [2], [4]. Pasture-based dairy farming is one of the industries, which has distributed data sources in a large terrain and essentially requires such optimized systems to ac- celerate current farming strategies [7]. In smart dairy farming, farms are being adopted with the new tech- nologies such as per-animal based milk yield and qual- ity monitoring, sensor-based animal behaviour track- ing [5] and robotic milking etc. to improve the quality and efﬁciency of dairy production. Among them Mid- Infrared Spectroscopic (MIRS) milk quality monitoring and its association analysis with other factors is vital for milk value analysis and for identifying associated phenotypes [8]. To apply DL on these datasets, a Com- pressed Learning (CL) approach (explain in Section 2) is commonly used to extract descriptive features from the raw data. Prior knowledge of the general characteristics of data is essential for a lossy CL approach to retain the precision of learning. According to the literature [13], [14], [15], the linear/non-linear behaviour of data has a considerable impact on the accuracy of the ﬁnal learning outcomes. The purpose of most of these studies were very generic because they were based on the fact that non-linear machine learning algorithms have better performances than linear techniques regardless of their complexity and the required computational power. However, linear approaches could achieve the same precision as non- linear techniques with lesser computation. However, re- cent data analytics, which are capable of doing complex learning with modern computational power, pay atten- tion to employ the most accurate learning approach. Therefore, understanding the original characteristics of the data in particular, non-linearity in CL is vital. In this study, we investigated the linear and non- linear behaviours of MIRS dataset (Fig. 1) in the con- text of milk quality predictions. First, pre-processing removed the impact of water absorbances from our dataset. Then non-linearity between the features in measurement-domain as well as in the compressed- domain were investigated for different milk quality parameters. Then the CL approach was used to per- form learning from the compressed data, which re- duced learning complexity. The impact of non-linearity were taken into account during the data compression based on linear (standard) principal component analy- sis (LPCA) and Kernel PCA (KPCA) techniques. The learning accuracy of using compressed-domain data was explored with a linear and a non-linear statistical predictive models; partial least square (PLS) and least squares support vector machine (LS-SVM). Section 1 has provided an introduction to the paper with its