This post covers an optional introduction to python to get hold of soccer data through Understat, as well as how to load the data into Tableau and map the shots on a pitch. Some example datasets have been created for those that want to just solely create the Tableau elements. A fun one for #SportsVizSunday folk!
Unlike previous blog posts where I cover off thought processes behind ideas I am hoping this tutorial will be a bit more substantial in terms of following along to create your own.
It is has been on my goals for the year to get more familiar with python and it’s usability. I am grateful that I am surrounded by encouragement from our team at work, those doing the #100DaysofCode course in the wider data community, as well as following some stars in the sports community. You may recently have seen Mckay Johns, on the most recent “What’s Good?” blog, or the headway Alexander Varlamov is making in his various sports tutorials.
If you’d rather not use python and skip straight to the datasets I’ve created then skip forward to the Tableau part in the bottom half of the blog!
The datasets can be find on my local Github repo, from the top of the page. They can be found in the Datasets Season 2019:20 folder. You will see the League Player data csv export includes an aggregated view of every players stats in the chosen league for the 2019/20 season, covering details such as number of games played, total cards and number of goals scored. The code then takes these players and finds every single shot they took within the league in that season, under the file Player shot data csv These files are split by league, and are also for the season 2019/20.
I have chosen to use PyCharm to run the code in but it’s up to you. You can download Pycharm Community from the link, here.
The code can be found in the GITHUB repo in the main.py file.
It will look like the below. I tried to make it both user friendly and easy to understand for anyone just starting out.
1. Create a new project and copy the code from the repo.
2. Paste the code into the new project area.
3. You will initially see that understatapi in line 6 above has a redline under it. This is because it is a package we do not initially have. Hover over it and click install package understatapi.
If this doesn’t work. Locate the terminal at the bottom of the page and write “pip install understatapi”
You will see the red line disappear once the package has installed.
(Update 27/04: Mo Wootten kindly offered to do some testing prior to release and pointed out there’s a dependency in the understatsapi package for numpy > 1.2 which meant that initially he couldn’t successfully install it (it comes up with an error about numpy.typing)
To get round this you will need to update your numpy package or install the ‘typing-extensions’ package before the understatapi. import numpy.typing
Now all we have to do is….
CLICK RUN (and follow the terminal steps (see point 5)
What does the code do?
We are tapping into the understatapi, if you hover over the word it gives you some more details as to what is included in the package.
The first two are print statements. You will see them pop up in the terminal.
5. You will want to click into the terminal and write your response choosing from the options. Why not try “EPL” and “2019” to start.
The code saves your text input and saves it as the choice made.
We later refer to these choices, so it is vital you spell the league name and year correctly otherwise the code fails. To have a better knowledge of what seasons are available make sure to check out the understat.com site! For example here is EPL’s choices years. We type in 2019 to get season 2019/2020.
6. What happens next?
We find all the league player data. Here it refers to our chosen league and season.
Next we look to export this data into a CSV. The index set to false removes a row counter column that is unneeded.
We reassign the naming convention of the ID’s within the dataset to player_id.
After this I take from the league_player_data all the team names. I create a set (a distinct version of the list). This is because I will want to refer to it later within the loop.
We then print a message for simplicity stating how many players are in the chosen league, in the count statement.
(At this point we have reached the league_player_data.csv) stage.
We will next look at the for loop statement. The code will automatically go on to start printing numbers up to the total count, 515 in this instance. I have done this because, if you’re anything like me, you’re super impatient and want to check the code is running. Once the number hits the total count it will stop and say “Process finished with exit code”.
I create a counter starting on 0 for the number of players. We will loop through each of the players (515) in the EPL to retrieve their goals as part of the get_shot_data() function and print these in a new csv called player_shot_data.csv
The code Opens the file if it exists in write mode, if it does not exist it therefore creates it. I have written it this way so the file overwrites itself each time.
For every single player within the list of 515 id’s do the following:
Find the player count number in the list (which will be increasing by 1 each time) and find all their shot location data.
Filter all the chosens players shot data to the chosen season_choice, without this we would record every single shot that player has made for all seasons! We then want to add 1 to the player count as on the next loop it will take the next player and find all the shot data for that player.
line ‘df3 = …’ looks to filter the shot data to teams where the h_team is in the set of teams we created from the previous dataset. If we didn’t have this filter on we would end up with shots from players who may have transferred out of / or into the premier league mid season! Therefore we filter to shots where the players were in games from the 20 teams in the EPL.
We print the number of times we have done the loop (because I am being impatient). Append this data we collect to the excel document for player_shot_data (mode=a)
If it is the first time running, include the header for the column names.
Finally the last thing to cover is the elif statement. I added this in because the code would fail if the player actually had no shot data for the chosen season. What this statement does, is if there is no data to retrieve, it will print that the data frame is empty for that person and continue onto the next player. Not everyone can score goals after all!
*Do note, I have this may not be the most efficient code, but I am hoping it is somewhat understandable for those that are new to python, or want to give it a go with no prior coding knowledge.
Once you have looped through all the players you will see your csv’s appear in the top left! They will be in the folder path of your main.py, wherever you created your project path. For me this is under my Macbook home and then PyCharmProjects folder. You can do a search in your finder if you are struggling to locate the folder path. I’d recommend opening them in excel rather than in PyCharm…. it looks prettier.
If this was your very first time giving it a go, let me know how you got on. We are all on this journey together. For those unfamiliar with python the most important thing to note is… correct indentations matter so make sure you got that copy and paste on top form.
For any columns you are unsure on what the abbreviations are check the understat.com website to double check.
If in doubt, print(”) it out.
Hoping all goes well with the Python we can now take our dataset and put it into Tableau!
If you didn’t fancy giving the Python a go, take a sample set data from my Gitrepo.. I have tried to check the data quality of these documents please message me if you spot errors.
I want to shine light on James Smith‘s blog, “How to create Football Pitches/Goals as Backgrounds in Tableau” This is super useful pre-reading.
We will be replicating a lot of what we see here before taking it one step further. He helpfully also shares access to the pitch template we will use. You can find a copy in the repo. Or save the one from below. Feel free to edit the colours of the pitch to whatever you like.
We have the X and Y co-ordinates from the player_shot_data.csv
James Smith rightly states that the most important things to take note of are the co-ordinates of the pitch template and to make sure it syncs with the background image. The other important thing will be to make sure our co-ordinates are the right way round so the players shot is on the correct side of the pitch!
We shall go through an example.
Connect to your player_shot_data.csv dataset, I will be using the EPL 2019 example from the python tutorial above.
2. We will need to scale the points from between 0 and 1 to the new scale of 120 and 80, ready for the background image.
Create two new calculated fields. One with X values multiplied by 120 and another calculated field Y where you multiply your Y values by 80.
3. Plot these new values as Dimensions. We can then add the background pitch image in.
4. Go to Map, Background Images and click player_shot_data (The data source).
5. Locate your pitch background image and set the co-ordinates as below. I also like to lock the aspect ratio and always show entire image.
6. Fix the axis as below.
7. The last thing I like to do is turn off the map layer zoom ability in. Map – Map Options, Untick the Allow Pan and Zoom. You will want to do this both on your dashboard and sheet.
8. Currently we have our dataset shots all on one side. But what if we want to take a look at a specific match and split the shots by team?
Lets take a user case of of Leicesters 9-0 win over Southampton from the EPL season 2019.
If we go on Understat this is what we will see.
If we find match 11740 and drag it onto filter we will filter down to all the shots within this specific game. We now want to update our previous X and Y calculations to split the teams shots in half but making sure the shot is from the correct part of the pitch. (Quick note: The match ID can be seen in the original website URL https://understat.com/match/11740 )
if [H A]= ‘a’ then
ELSEIF [H A]=’h’ then
Explanation: We want the away shots on the left, the direction that they are shooting. We want our home teams shots on the right, again the direction they are shooting. (If you’d want these shots actually split by half, e.g time you will need to change these calculations.) For now we are replicating the Understat website.
if [H A]= ‘a’ then
ELSEIF [H A]=’h’ then
Similar to Understats website, The sizing is on expected Goals and the stars represent the Goals, whilst circles represent other chances. The colouring is based on the team.
Can we take this one step further with map layers? Absolutely.
If you have read previous blogs of mine you may have come across a few visualisations where I have utilised layers.
A Map ranges from -90 to 90 for latitude and -180 to 180 for longitude. This means we can still use our previous X and Y co-ordinates from above.
To create this we need the following calculations:
Double Click MP Pitch, to set the longitude and latitude. Drag ID onto detail.
Explanation: The LOD for each shot requires these dragged onto detail. At the moment you will see the shots plotted on an actual map and it may be a bit confusing but stick with it!
We then need to recreate our background image similar to before.
Make sure to note that our X Field and Y field are now referring to the Longitude and Latitude generated instead of our previous X&Y mappings! This is super important to get the background to show.
If you turn the background map to none you will se the pitch re appear. We need the background map turned on however to add more map layers! So if you do turn it off to check its working make sure to turn it back to one, for example light.
Next I will add the filters we had previously for the Leicester – Southampton game. Theoretically we are now at the same stage as before, but are now in a position to add new layers.
I want to next make some title calculations and the score and add them as new map layers.
Explanation: Making a point at the respective co-ordinate. Which we will fill with a text value.
Explanation: Same as above. My tip here would be to drag the team name onto detail, then edit the text either with your own text or using the field name as shown below. Amend sizing and colour to your pleasing.
MP Leicester Goals
Explanation: Making a point at the respective co-ordinate. A tip would be when you edit the colour of the text remove the halo to get rid of the grey background.
MP Southampton Goals
Explanation: Making a point at the respective co-ordinate. Which we will fill with a text value.
Once you have added them all you can switch your map layer back off and it will show the final results.
And there we have it. There’s obviously a lot more that can be done with this in terms of making it aesthetically pleasing but hopefully this gives the basis to build your own shot maps.
If you fancy doing something different why not take a single players shot data for the entire season?
Or, create small multiples plotted of each of the games for a team, throughout the season? That would be quite nifty given then amendments in calculations needed to split the pitches out.
Here’s a dashboard I made from a match this season (2020/21) before writing the blog, that feels a little more ‘me’. Oddly enough, also a 9-0 result against Southampton. It can be viewed on my Tableau Public. I’ve appended the above tutorial to the workbook at the top of the page “Scraping Football Data Development”.
As always, if you have any questions please reach out to me on Twitter, or Linkedin. Super enjoyed the learning process with this one so I am chuffed to be able to share it with you. Shout out Mckay Johns, Sagnik Das, and James Smith who all have awesome Python/Tableau content that inspired this blog.
Lastly, thanks to those that offered to do some testing on my tutorial. Especially Mo Wootten and Alberto Oraá.