International Journal of Computer Applications (0975 – 8887) Volume 181 – No. 50, April 2019 54 Soccer Analytics using Machine Learning Abha Tewari Asst. Prof, VESIT Tushar Parwani VESIT, Mumbai Ajinkya Phanse VESIT, Mumbai Akshay Sharma VESIT, Mumbai Anush Shetty VESIT, Mumbai ABSTRACT Sports Analysis is rapidly growing area of sports science with the ever increasing easy internet accessibility and recognition of Machine Learning. This can be a motivating space of analysis for soccer, as soccer is considered way more complicated and dynamic when put next to a couple of different sports. Additionally its the world’s most liked sport, played in over two hundred countries. Many methodologies, approaches and measures are being taken to develop prediction systems.The paper is developed to predict the outcome of the matches in English Premier League(EPL), by studying the trends from the previous matches and identifying the foremost vital attributes that are required to accurately predict the result. XGBOOST, SUPPORT VECTOR MACHINE and LOGISTIC REGRESSION models were taken into consideration and chosen the most effective among them to build the prediction model. This model is applied on real team information and fixture results gathered from http://www.football-data.co.uk/ for the past few seasons. Keywords Football, Prediction, Machine Learning, F-SCORE 1. INTRODUCTION Prediction systems have proven to be of importance in a range of fields like stock markets, sports, on-line searching, and so on. In sports, these systems can be especially useful for coaches to investigate the performance of the squad, enhance their game setup, etc. Sports’ card-playing conjointly has been growing in integer rates over the past few years. As a result, Machine Learning is presently a extremely trending approach. For the prediction of the likely outcome of the most-watched football event, numerous simulations were performed and three modelling approaches adopted : Poisson regression models, random forests, and ranking methods .To model the previous scores of the competing teams as (conditionally) independent variables, Poisson regression approach was used.With love for the game and inspiration from these researchers, decision was took to predict the results of football matches in the Barclays Premier League, that is hailed to be the foremost exhilarating league of soccer within the world. The League operates on a promotion and relegation basis with twenty groups competing with one another to accomplish their ambitions as a club. Before continuing to the most important section of the paper, need to review a couple of previous works during this field. 2. LITERATURE SURVEY By observing the results from paper [1], understood the techniques to improve the efficiency of the prediction. Traditionally many models have been built to predict the results using goals scored by each team as a metric. Using the paper, got to know the inconsistencies introduced by it. But this paper has used a “expected goals” metric which takes into consideration the teams performance rather than just the goals scored. By analysing the devised of techniques to clean the dataset and introduce new attributes that would provide in depth metrics for accurate prediction of the winning team. In paper [2], authors have proposed a logistic regression model to estimate 2015/2016 Barclays’ Premier League match results with an accuracy of around 69.5%. They develop this model with the help of data from Barclays Premier League and sofifa.com using four significant variables: Home Attack, Home Defense, Away Attack, and Away Defense. They implement this method in software called Football Predictor. Their work predicts who is going to win a match (home/away), and list out details regarding the odds and probability, and the coefficients of regression. This model comprises of just four variables but gives strong prediction accuracy. Various techniques have been utilized to develop result prediction systems. In explicit, football match result prediction systems have been developed with techniques such as artificial neural networks, naive Bayesian system, k-nearest neighbor algorithms (k-nn), and others.The choice of any technique depends on the application as well as the feature sets. The priority of a system developer or designer in most cases is to get a high prediction accuracy. The objective of [3] study is to investigate the performance of a Support Vector Machine (SVM) with respect to the prediction of football matches.The findings showed 53.3% prediction accuracy.From [4] the conclusin comes that the XGBoost is quite an efficient algorithm for predicting the results. In [4] the results obtained were quite up to the mark as the only parameters considered were ranking and points data. But, if the algorithm runs efficiently for the set of attributes needs to be tested. In the paper [5], the authors study multiple techniques in data mining and their prediction results are correlated to devise a good model for predicting matches of the Dutch football team. They use three major models namely Generalized Boosted Models (GBM), K-nearest neighbor and Naive Bayes classification. Using GBM, they attained 60.22% accuracy on average, while the other models were not as accurate. The results of the paper were based on a data-set that only included information about the Dutch team but no data regarding the opponent except for their FIFA ranking. To further improve this research, more data and statistics could be taken into account such as the opponent team’s overall form in that season, and other factors such as head-to head results or information about each team’s previous games. Different evaluation processes gauge different characteristics of machine learning algorithms. The factual evaluation of algorithms and classifiers is a matter of on-going debate amongst researchers. Most measures in use today focus on a classifier’s ability to identify classes correctly, [12] Note other useful properties, such as failure avoidance or class discrimination, and it suggest measures to evaluate such properties. The measures named Youden’s index, likelihood, Discriminant power are used in medicinal diagnosis. It also lists other learning problems which may benefit from the application of these measures. 3. PROPOSED METHOD First of all, the Dataset is passed to a system. Now this Dataset will be going through Cleaning Process and hence use the data to find the correlation between parameters by plotting the scatter plot define key attributes which will be used using Jupyter Notebook.This cleaned Dataset will be given to the system along