This blog has the aims of teaching the basics of python packages, inspecting html, as well as a few functions seen within the code. It will include a bunch of resources for those just starting out. In fact, those who followed me long enough on Twitter may have remembered a few of us from the Tableau community started #100DaysOfCode together. This is the updated course by Angela if you fancy taking a look.
Python has been a grey area for me, one over the last year or so I’ve started to learn by doing, and one that I know will serve great purpose if I keep developing these skills. So this tutorial is aimed at beginners, vaguely like me. Those that have a basic understanding of python functions but wants to start with their own passion project.
WHAT IS WEB SCRAPING?
Think of web scraping as scanning a site for information based on the pre-existing structure of the code that sits behind the website. Site’s structures can vary, so web scraping is not the most efficient long term method of retrieving data. It’s useful to try see if an API already exists. Due to copyright some websites will not allow web-scraping so be sure to check prior!
Warning: Always be mindful if the data you are trying to access is allowed by the website. One way of checking is looking at the robots.txt file. For example look at Facebooks website file here. In any case I would be against individuals trying to scrape data to then use commercially, and I would look to only use data that is in the public domain. For example, where you don’t need login credentials.
SO WHERE TO START?
Step 1 is find a passion project. I use to love and play field hockey in my younger years. So where better to start than looking at the league table for the best division in England.
You can find the link, here.
This is the table we want to scrape.
Positive things to note about the data:
- The headers and rows are organised structurally in a way that will be easy to interpret.
- There are no gaps in the data.
- The ‘form’ column (WWWLW) is written as strings not images which helps with interpreting the result.
I won’t go into the finer details, but when you’re clicking around on a site it sends an HTTP request to the server which retrieves a response message. This message is in HTML format which is then converted and displayed on screen. What we will want to do is retrieve this response message and save it down. We then go through the response to find the information we need through our python code.
We will cover some of these concepts shortly. For now, right click on the table and click inspect.
WHAT IS A BEAUTIFUL SOUP?
“Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.” – Quote from Analytics Mag.
Okay that kind of makes sense… right? We take the web page code, dig through the different levels and take the data from that, in which we need. For example – we will want to look within the overall table on the page, find the column headers of the table and then look within each row data and find each associated value. The beautiful soup package helps take the HTML code and gives functions that allows us to dig through it.
HOW DO WE KNOW WHAT INFORMATION WE NEED?
HTML is broken into various sections. Try familiarise yourself with the inspected page. This hierarchy becomes increasingly important when we write the python code.
You will see a bunch of code pop up when we inspect the page. Try hovering over the different elements to see which part refers to which section of the website.
As a general overview you will see the Table within the HTML code here is listed as #ehLeagueTable.
Within the headers outlined under </thead> are the different column names we will want to refer to.
You will see within the above printscreen there are two closed arrows of <tr>.
These refer to closed trees, of each table row, or in this case rows of data, which would refer to Wimbledon and Surbitons results.
Within the body there are the various values associated to the column in order. For example following these two trees is:
<td>Old Georgians M1</td>
If you expand the tree out to find the td value you can see that the ‘1’ is associated to the one loss Old Georgians M1 have had.
Within each table tag, there are TR, TH and TD tags.
TR is the table row tags of the table rows.
TH is the table headings tags holding the headers, or the column names.
TD is the table data which holds some of the granular detail.
If you’d like to learn more about HTML, a great starting point is a coding tutorial website by W3Schools. I would recommend reading about HTML Tables here.
HOW DOES THE CODE WORK?
So, the fun bit. Feel free to download the code, here.
The important things to note with beautiful soup is that you need the url of the page. You then send a request to that page and parse it through html. The data is in the text content of response, which is req.text, and is the HTML. We can use the html.parser from BeautifulSoup to parse it, saving us a lot of time when web scraping in Python.
We then save that parsed request down as the table body.
Within the table body we search for the specific drop down of the table.
If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.
You can use the find method, which will return a single beautifulSoup object.
We use a dataframe to take each of the components in the element for each row of data and save it to our table.
We then export this information into a csv.
WHAT SHOULD I BE AWARE OF?
It can be fairly difficult to navigate through the different hierarchy of trees. As a beginner, I found printing everything as I go and watching the errors was the best way for me to learn. I hope this somewhat rings true with others. You will find Priyanka does a wonderful explanation of the tags and attributes in a recent HerData guest blog.
Depending on your table structure you may have to lean on your python experience a little more. Check out this previous guest collaboration with Anmol where we looked to create a football event timeline.
WHAT IF I WANT TO RUN THE CODE MYSELF?
As another reminder, the code can be downloaded from the repo here.
I tend to run my code in Pycharm but understand others may have their own preference. If you’re fairly new to python, fear not. The code should run when you open a new project, however you will have to install in the terminal the packages used:
pip install beautifulsoup4
pip install pandas
TIPS FOR BEGINNERS
- Google Google Google. It is your best friend. In most cases someone will have done similar.
- Use prettify.
This command will help print the HTML in the console in a readable format, which you can then navigate through.
- Don’t be disheartened if your code is creating errors, If in doubt. Print it out. I still write my code fairly sequentially compared to the correct way of building functions and classes that are referenced. Everyone is on their own journey. Don’t be afraid to break stuff!
I found the following sites useful as a starting basis:
- Try web-scraping a different table such as the womens league.
- Try scraping a table from a different sport.
- Try saving the file in a different format.
- Try building the table within Tableau.