Hi all,
I hope everyone has settled into the year well, as we come up to being 1/12 of the way through 2023! Today it is an absolute pleasure to have Jessica Moon join the site to teach us some tricks with web scraping using Selenium and how then to visual display the output of Twitter Circles in Tableau.
Like with all tutorials, the resources can be found in the GitHub repository under the title – and the Tableau Public link points to Jessica’s dashboard. You can follow Jessica for more updates on Twitter here.
Over to Jessica for the run through.
When TwitterCircle.com hit the scene, Twitter was ablaze with Circle photos and I enjoyed looking at the #datafam circle posts and would instinctively inspect to see if there was a circle with a dark background and a woman with dark hair among those featured. I ran my own to see who I was interacting the most with. But then I wondered—was the engagement mutual? The website would allow you to search yourself or any other user without a login. This was the opportunity to write another Python program!
While I prefer to use Beautiful Soup for web scraping, it has limitations like it can’t click elements of a webpage and if the page uses a lot of JavaScript—the soup isn’t going to return much of use. These are situations where I use Selenium. Selenium can not only scrape the rendered html of a page, but it can also enter text, click elements, etc like a user could interacting with a site.
Before we discuss specifics, web scraping can be a real grey area. Make sure to check out sites’ terms of use and /robots.txt files. In the program I wrote, I made sure that between waits and sleeps, I would not be bogging down the site as each user ran (45 in total) would take approximately 35 seconds before the next user was ran.
If you’re new to Selenium, I recommend checking out these 3 videos (video1, video2, video3) to learn how to install the package + driver, and understand what all Selenium can do in regards to web scraping. I honestly found these in preparation for this blog, and I discovered more elegant ways to slow down the program so elements could be rendered than I had coded. But, y’all, bad code can work too. If you’re just writing a program to get the data you need and not to put food on the table—don’t be afraid of some bad, ugly, not optimally efficient code. It can get the job done. You might notice in my code I use a FireFox driver whereas these videos use Chrome. I’m pretty sure I found out about Selenium on Stack Overflow and the example code I started from used FireFox as the driver, so that’s what I’ve been rolling with.
The program I wrote does the following:
- Opens TwitterCircle.com
- Enters my username
- Clicks generate
- Scrapes the ranks and usernames of circle 1+circle 2+circle 3
- Then it does the same cycle for the list of usernames to get their circles
- Looks to see if the circle username is my username or in my circle list.
- Two dataframes are created: one that combines the ranks, usernames, and circle numbers of my Twitter circle (45 rows) and one that gets similar data for each person in my circle for their respective circles (2025 rows assuming everyone is engaging with 45+).
- The dataframes are merged (like a SQL join) on My Circle and user, so the resulting set is 2025 rows.
I chose to find elements by xpath to enter the text and click the button because they are very specific to elements whereas doing something like class will find the first instance which might not be the element you intended. How do you find this?
Open the page of interest and put your cursor on the item you want your program to interact with. Right click and “Inspect”. In the code that appears to the right, right click the highlighted html for the element and Copy -> Xpath
The code can be found on GitHub. (Circle Scrape)
Things to be aware of:
- You can copy and paste it into your preferred IDE after you’ve installed Selenium + the driver.
- Make the proper adjustments like updating my_username and possibly increase sleep_time if your internet lags.
- The csv file will be available in the same folder as your IDE, like mine is C:\Users\jessi.
- The code may take about half hour to fully complete given the number of times it needs to loop through each of the different individuals.
An Example of Jessica’s Output can be found on GitHub.
Tableau
On the Tableau side, I decided to take a self-centered approach in my viz, attempting to mimic the spacing in the circle image generated. I’ll run through the main calcs that you can use with the data generated from the program.
With radials you need to:
- Figure out the angle interval that evenly divides a circle (or semi-circle)
- Figure out the angle that each item will be placed at.
- Calculate the X coordinate using the angle and radius
- Calculate the Y coordinate using the angle and radius
If you’re ever wanting to do semi-circles—I recommend copying down the Nick Saban viz. That’s usually what I’m looking at each time I do radials anyways.
Parameters:
Inner Circle (Float, default .7)
–this parameter controls the space in the middle for your profile picture or whatever.
Coordinate zhuzh (Float, default 1.1)
–this parameter boosts or minimizes the radius and controls how tightly the pies surround the inner circle.
Calcs you’ll need:
Origin
//center point for you to be in the middle
MAKEPOINT(0,0)
Starting Angle
/* I usually hard-code or use a single parameter, but the three values kind of mimics TwitterCircle’s placement and helps it look a little uneven which I like here */
CASE MIN([Circle Ring])
WHEN 1 THEN -50
WHEN 2 THEN -100
ELSE -180
END
Adjusted Rank
/*take the actual rank and subtract the minimum rank for the ring (1, 2, or 3) so lowest ranked user is adjusted to 0 in each ring and other ranks are adjusted accordingly*/
MIN([My Rank])-MIN({ FIXED [Circle Ring]:MIN([My Rank])})
Angle Part 1
//determine what angle evenly spaces points in each circle
360/MIN(({FIXED [Circle Ring]:COUNTD([My Circle])}))
Angle Part 2
/*Adjust the angle to start at custom degrees (See Starting Angle) and calculates the exact angle for the point observed*/
([Starting Angle])-((([Adjusted Rank]))*[Angle Part 1])
X
/*Determine x coordinate using Angle Part 2, adjusting for the inner circle and coordinate zhuzh.*/
COS(RADIANS([Angle Part 2]))*([Inner circle]+(MIN([Circle Ring])*[Coordinate zhuzh]))
Y
/*Determine y coordinate using Angle Part 2, adjusting for the inner circle and coordinate zhuzh.*/
SIN(RADIANS([Angle Part 2]))*([Inner circle]+(MIN([Circle Ring])*[coordinate zhuzh]))
Circles
MAKEPOINT([X],[Y])
Here’s the build in action:
So remember:
- My Circle has to be on Detail for layers with people in your circle so the points will disperse radially.
- Top Circles mark is Pie
- Common goes on color with yes sorted to the top.
- COUNTD(Circle Person) goes on Angle.
Check out the finished product here! If you have any questions, send me a message on Twitter!
Round-Up:
I can totally resonate with how Jessica so eloquently put “Bad code can work too. If you’re just writing a program to get the data you need and not to put food on the table” If it runs, it runs. Unless you’re bringing down the server and costing your business a fortune, ha.
Anyway, I want to thank Jessica once again for putting together such a fantastic piece of code, blog and accompanying visual. It’s not often we get to see a full E2E, and I hope this may prompt others to visualise their own Twitter Circles down the line or apply the logic of the code to their own web scraping abilities.
LOGGING OFF,
CJ