This post covers an introduction to python to get hold of soccer data through the Statsbomb Python package that offers free public data! We will look to create a multitude of datasets from competition level, to the matches within that competition, as well as getting to the more granular event level data and even shot freeze frames! The hope is to get more people into coding through starting material.
You may have remembered some previous posts on scraping from understat which you can find here for visualising shot data. Since then Statsbomb have made it really easy to access similar soccer open data through a python package.
The tutorial today for sake of keeping it concise will only look at code, but in future tutorials will look to create some visualisations in Tableau! I have lots of exciting examples in the pipeline for how we can recreate some soccer related charts. This blog is specifically aimed at beginners.
Where to begin?
Lets cover off some resources I came across that I found useful and why.
I would consider this page by Statsbomb required reading
–Statsbombpy – essentially this is the package we will use.
–Statsbomb Open Data – Definitely worth reading the getting started section as our theory will come from this.
You might also want to dig through Statsbomb’s terms and conditions, here as a reference for future data usage. A final thing worth raising is to use Stackoverflow for when you get stuck. Hopefully todays walk through gets you most of the datasets you need but this website comes in handy more times than not!
If you’d like to download my code you can find it at the link under the title name in the github repo. We will look to explain the code as we go along.
You may have previously read viz legend Alexander Varlamov’s tutorial on extracting pass data. This too, was my starting point. In addition this towards data science blog has a run through of utilising the repo, to extract datasets.
Luckily Statsbomb have made this alot easier to do now, but the theory from these blogs we will want to carry forward.
What do I mean by that?
Here is the repo for the open data. We’ll actually be using a package that stores this information but it will be useful as a reference point as we dive in, so will look to explore the folder structure. As we will be using a prebuilt package there is no requirement to download this repo specifically.
I’ve chosen the FA Women’s Super League data for this example. You will note that each competition has an ID associated to it (37 in this case) and each season for that competition will have an ID (90 in this case).
Load up your Python IDE, I currently use Pycharm.
You will want to pip install statsbombpy, pip install pandas and pip install numpy for this run-through.
Wow. So this gets all the competition details and exports it to a csv for us in a nice clean readable format. This dataset will host a whole range of different competitions with the season details and time frame of competition.
Small side note – You may see that Statsbomb print in the run space “credentials were not supplied. open data access only”
Fear not, this just means that we haven’t provided details to access private paid for data, hence the reference to the open-data source at the start of the tutorial! If all goes as expected, your code will still finish with exit code 0.
So let’s next take the competition 37 and season 90 to find all the match details with this league. This refers to the 2020/21 WSL.
The WSL has 12 teams. So in our dataset we would expect to see (12*12)-12 matches. Thats 144-12=132. I.e Each team plays each other twice but can’t play themselves.
Interesting, our data only shows 131 records. Why is this? You will see the Tottenham – Birmingham game does not show as it was a walkover.
Therefore, I am happy with the dataset. Again, we have exported it into a nice readable format. This dataset includes when the matches were played, the teams and score amongst other useful metrics.
So next we want to get the event details for a specific match. It looks like the Chelsea – Bristol game was an action packed game at 9-0. The match ID for this was 3764235. We follow the documentation to extract the event level data.
Next is a chunk of code that may get a little more confusing. Firstly, the location details in the event table need a little reformatting. In summary, we are splitting the values out into new columns holding coordinate integers.
We do the same type of transformation for the shot location column. The shot column doesn’t always have three values in so we make sure to assign the x,y and z of the shot. Here is a before and after of the transformation we apply to the location field.
The final transformation I make is solely choosing the columns I want to keep within my final table. If you want the full list of columns you could remove this step and export the match events dataFrame to a csv.
The last thing we do is create a final dataset that takes all the shot events for the chosen match.
The shot event data is stored in a column that we have to unpack. Here is a before and after to show what the code is doing.
What you will see is that the raw data previously held in “Shot freeze frame” column is now split out. The shot freeze frame column is only populated on where “type” is equal to “shot”
This new data is saved to a separate CSV (shot_freeze_frame).
How can we double check this is correct? Well we can compare our 37 shots from the events data, and compare the id’s to that seen in the freeze frame data by deduplicating it. It gives 37 unique id’s.
And there we have it, a short run through of how to go through multiple levels of StatsBomb data from competition level to individual event level.
One last thing to note is that if we wanted to start visualising the shot data with the freeze frame added in we could left join our match_shot_freeze_frame dataset to match_events on id = id. This is the id of that specific event! (do not confuse it with match_id!!)
As mentioned before we will look to build some visualisations off the back of the datasets created in future blogs.
Why not try apply this logic to a different competition?
Why not try capture all events within a competition using a loop?
Try using the dataset to build a visualisation in Tableau
Why not use some of the MPLSoccer tutorials?
Consider what further data preparations would be needed to transform the dataset for building out specific chart types, shot maps, freeze frames etc.
Here are some starting resources for when we start to visualise the data: