On establishing nonlinear combinations of variables from small to big data for use in later processing Jerry M. Mendel ⇑ , Mohammad M. Korjani Signal and Image Processing Institute, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089-2564, United States article info Article history: Received 29 December 2013 Received in revised form 29 March 2014 Accepted 25 April 2014 Available online 9 May 2014 Keywords: Big data Causal combination Fast processing Nonlinear combination Parallel and distributed processing Preprocessing abstract This paper presents a very efficient method for establishing nonlinear combinations of variables from small to big data for use in later processing (e.g., regression, classification, etc.). Variables are first partitioned into subsets each of which has a linguistic term (called a causal condition) associated with it. Our Causal Combination Method uses fuzzy sets to model the terms and focuses on interconnections (causal combinations) of either a causal condition or its complement, where the connecting word is AND which is modeled using the minimum operation. Our Fast Causal Combination Method is based on a novel theoretical result, leads to an exponential speedup in computation and lends itself to parallel and distributed processing; hence, it may be used on data from small to big. Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction Suppose one is given data of any size, from small to big, for a group of v input variables that one believes caused 1 an out- put, and that one does not know which (nonlinear) combinations of the input variables caused the output. This paper presents a very efficient method for establishing the initial (nonlinear) combinations of variables that can then be used in later modeling and processing. For example, in nonlinear regression (e.g., [21,22]) one needs to choose the nonlinear interactions among the vari- ables as well as the number of terms in the regression model, 2 in pattern classification (e.g., [7,2]) that is based on mathematical features (e.g., [23]) one needs to choose the nonlinear nature of those features as well as the number of such features, and in some neural networks (e.g., [11]) one needs to know which combinations of the inputs and how many such combinations should be fanned out to one or more of the network’s various layers. Our Causal Combination Method (CCM) that is described in Section 3 provides the initial combinations of the variables as well as their number, and can also be used in later processing to readjust the combinations of the variables as well as their number that are used in a model. Our Fast Causal Combination Method (FCCM) that is also described in Section 3 is a very efficient way of implementing CCM for data of any size. Establishing which combinations of variables to use in a model can be interpreted as a form of data preprocessing. According to [24]: ‘‘Data preprocessing is a data mining technique that involves transforming raw data into an understandable http://dx.doi.org/10.1016/j.ins.2014.04.042 0020-0255/Ó 2014 Elsevier Inc. All rights reserved. ⇑ Corresponding author. Tel.: +1 213 740 4445; fax: +1 213 740 4456. E-mail address: mendel@sipi.usc.edu (J.M. Mendel). 1 How to choose the variables is crucial to the success of any model. This paper assumes that the user has already established the variables that (may) affect the outcome. 2 According to [5, p. 20], ‘‘... in practice, due to complex and often informal nature of a priori knowledge, ... specification of approximating functions may be difficult or impossible.’’ Information Sciences 280 (2014) 98–110 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins