Data-driven performance analysis in soccer: A compilation of data science and machine learning techniques for pre-processing and knowledge discovery

Publication: Book/ReportDissertations

511 Downloads (Pure)


Big data has proven to be of increasing influence in a wide array of decision-making and knowledge-discovery processes across multiple domains. Gathering, pre-processing and analysis of data to make decisions or in the knowledge discovery process is a non-trivial task. With increasing access to computing power, a multitude of data science techniques have emerged as problem-solving tools in many domains of society. In the last decade, a vast amount of data has been collected in the field of sports. Several labs across the world have used this data in conjugation with Machine Learning/Data Science (ML/DS) techniques to add significant value to the sports industry and academics. However, compared to the potential of the available data, only a small subset has been exploited. This is primarily due to the lack of coding/programming expertise required to get the data in a form which is optimal for building models to answer specific questions of interest. The problem of pre-processing bottleneck has partially been solved due to data resources such as OPTA, STATSBomb and, which provide clean tracking, event and notational data. Additionally, libraries in Python and R such as Floodlight, AMIE, and SoccerAction offer packages which streamline pre-processing and visualization steps, thus offering great access to big data analysis for domain expertise with limited coding expertise. With these developments in mind, the current thesis aims to introduce ML/DS methods such as regression, binary classification, feature engineering and k-fold cross-validation into the field of sports analytics. This can potentially provide domain-specific experts, with the necessary technical tools to exploit the rising amount of data in the sports industry.

Through published case studies, each of which addresses a specific hypothesis. The thesis explains the importance of the normalization of KPIs as a feature engineering step before statistical modelling. It also elaborates on the value of using k-fold cross-validation as a model evaluation criterion for both regression and classification problems. The thesis further emphasises the value of using multiple ML models for solving specific problems as model robustness to avoid false findings due to the bias of a single algorithm. The provided methods can potentially be applied across research in general but the field of bat and ball sports like Cricket and Baseball seems to be conducive for big data analysis using ML. This is due to their unique closed-action nature (one action, one reaction leading to a result of that action-reaction pair) as they have a lower degree of randomness as compared to invasion sports. The thesis has a few limitations due to its scope. It only covers binary classification and two different regression methods, which require comparatively low processing power. Complicated methods such as neural nets and deep learning are out of the scope of the thesis, which may potentially improve observed results. Although comprehensive, the thesis is still not an end-to-end pipeline. It covers the modelling stage of the knowledge discovery cycle and only at a match or season level. Future research needs to apply techniques outlined by the current thesis on an event or play-by-play data. Furthermore, The steps of pre-processing and visualizations need to be the focus of future research in conjugation with the findings of the current work. In conclusion, sports research needs to leverage big data for finding novel solutions to a wide array of problems across sports domains, and ML/DS methods seem to be the ideal tool for this. Specifically, the crucial steps required are the normalization of notational data, using multiple models for robustness and k-fold cross-validation for determining the out-of-sample validity of the findings. Furthermore, the thesis provides an introduction to how data science techniques and multidisciplinary approaches can help the sports industry and research.
Original languageEnglish
Place of PublicationKöln
PublisherDeutsche Sporthochschule Köln
Number of pages25
Publication statusPublished - 13.07.2023


Dive into the research topics of 'Data-driven performance analysis in soccer: A compilation of data science and machine learning techniques for pre-processing and knowledge discovery'. Together they form a unique fingerprint.