Cluster Comput DOI 10.1007/s10586-017-1212-x A revised framework of machine learning application for optimal activity recognition Mohsin Bilal 1 · Faisal K. Shaikh 2 · Muhammad Arif 1 · Mudasser F. Wyne 3 Received: 22 March 2017 / Revised: 30 August 2017 / Accepted: 17 September 2017 © Springer Science+Business Media, LLC 2017 Abstract Data science augments manual data understand- ing with machine learning for potential performance increase. In this paper, data science methodology is examined to enhance machine learning application in smartphone based automatic human activity recognition (HAR). Eventually, a modified feature engineering and a novel post-learning data engineering are proposed in the machine learning framework as the alternate of data understanding for an effective HAR. The proposed framework is examined on two different HAR data sets demonstrating a possibility of data-driven machine learning for near an optimal classification of activities. The proposed framework exhibited effectiveness and efficiency when compared with the existing methods. The modified feature engineering resulted in 42% fewer features required by support vector machine to yield 97.3% correct recogni- tion of human physical activities. However, the addition of post-learning data engineering further improved the model to perform 99% accurate classification, which is an almost optimal performance. Keywords Data-driven machine learning framework · Activity recognition · Post-learning data engineering · Composite feature set · Smartphone sensors B Mohsin Bilal mbhashmi@uqu.edu.sa 1 College of Computer and Information Systems, Umm Al Qrua University, Makkah, Saudi Arabia 2 Department of Telecommunication Engineering, Mehran University of Engineering and Technology, Jamshoro 76062, Pakistan 3 School of Engineering and Computing, National University, San Diego, USA 1 Introduction Data and algorithms are central in computing and informat- ics. Machine learning in last few decades devoted itself to developing algorithms for learning the underlying patterns in the data. It resulted in a set of valued learning algo- rithms for many real-world applications. These applications are changing the way of doing science, engineering, arts, and entertainment. On the other hand, a widespread use of computing technologies allowed capturing data at large scale with continuity. A huge amount of data is being generated in every second in many different forms. Learning from this data is becoming a challenge for the researchers in comput- ing and informatics. In due course, new fields are emerging where the study of data is augmented with machine learn- ing for better knowledge discovery. A recent trend is widely known as data science with the focus on business analyt- ics and decision support. Here, in this paper, we argue that data science wraps more data understanding around machine learning algorithm to get a higher degree of benefits for busi- ness analytics and decision support. Alternatively, it may become a natural heuristic to optimize the performance of machine learning framework, in general. Therefore, in this paper, we are investigating the performance of machine learn- ing application in light of trending data science practices with following motivations. – The major goal of this research is to augment data under- standing by dual data engineering in the machine learning framework in such a way that it extends the performance boundaries of machine learning applications. This is consistent with the data science methodology where post- learning data understanding is manually done, mainly for model optimization to achieve the targeted learn- ing goals. However, in addition to serving the current 123