Hi all,
Today we are going to be revisiting some of the content I wrote last year based on expected goals in football, or soccer depending where you are in the world.
In this blog, I specifically look at a python package that helped me retrieve all the shots taken in a league for a specific year. To build a catalog of data for the whole season really was a dream.
What I’ve noticed over the years are alot of analysts will tend to make all their shot charts and summary dashboards for football using python or R. Something I have no problem with. However I want to give users the ability to chart as many of the charts also in Tableau.
One of these charts is showcasing the cumulative xG (Expected Goals) over the duration of a match. In short, a line graph! (lol)
But it may not be as simple as you first suspect.
This tutorial will talk through collecting the stats needed for a game and then how best to emulate the visuals seen on the Understat website.
Where to begin?
Well I had the pain of watching my team lose 3-1 to Arsenal. We can take a look at how that game played out here.
We can see over the course of the game West Ham had an early chance that they capitalised on. Since that goal, they really didn’t have many chances to score more.
Arsenal had a few chances in the first half, but started having more dominant goal scoring opportunities in the second half.
So how can we get this data?
Well, the code is attached at the top of the page. Usually I would use the python package, but i’ve amended what was originally Mckay Johns code to pull out the xG components we need using BeautifulSoup.
Once again, I can’t take any credit for Mckays code but would recommend it for those wanting to access some of the key components from the site. You’ll notice I added a few extra columns to the original code from here.
A few things I’d say is
- Make sure you install all the packages you need. A common error is not installing the lxml parser.
- If you add more attributes, make sure you clarify if its home or away and what list you want to append the value to.
- You can amend the match event to whatever match you are after, why not even try to loop through a teams for the whole season?
- Each game has its own unique match_id, the one we are looking at is https://understat.com/match/18362
A copy of the Arsenal – West Ham game can be found in the repository.
On to the visual.
So, you will see that the xG is cumulative across the match. Currently our dataset contains the xG of each event of the game, it will also have the minute of that event.
With a normal line graph it will connect all the points. We can add in a second axis with our marks and drag result onto shape to highlight the goals. When we calculate along the axis we have to make sure we don’t have the team box ticked.
But is this a reflection of what we see on Understat?
Not in it’s entirety no. This is because we are joining the points up of the marks we do have. On understat, the increase in xG is based on specific minute events.
This means we will need to create a minute per minute analysis. One method we could do is join in a new dataset of every single minute of the game.
That, or we can create bins!
Right click on minute and create a bin. Make the bin size 1 (i.e a bin for every minute of the game)
We can now drag our discrete minute bin onto columns instead of our minutes field.
This way, we only see spikes in the sum(xG) where the event happened.
Much better!
One final thing I’ve noticed is that when you go to create bins, it will do it based on your data. So in this case, it creates the bins from a minimum of 12, as that was when the first minute was.
Obviously, though the game started earlier than this, so to be a true representation of the data we want the 1 minute mark in the data, (and perhaps a 90 minute, or longer if added time)
One way to get around this is to add in two extra rows into the dataset.
Now when you go to create your bins, the minimum and maximum value will change.
This allows you to be able to plot the full game.
BUT, there is a way to avoid having to do this if we use a continuous minute bin!
After asking around a little, Val helped me with a solution that doesn’t need the additional rows of data. Below is the better option.
You can keep the minute bin as continuous but change the path to the second option on the path list.
What you will notice in the difference of having the minutes bin continuous is that it allows for more than one event to happen in the same minute. It also allows us the flexibility in being able to set our axis to zero without the need for amending the data.
I’ve left both sheets in to compare!
Feel free to download the workbook off Tableau Public from the top of the page if you get stuck.
Thanks for reading.
LOGGING OFF,
CJ