A few months ago, I was trying to decide what units (classes) to pick for my economics major. Having satisfied all my core units, I had to pick four units from a list of about a dozen to complete my my degree. As any conscientious student would do, I went on the university website, read through the description for each unit and tried to deduce which ones would be the most useful or the most interesting. But then I had a better idea.
Everyone knows that a good unit can be ruined by a terrible teacher or an awful curriculum. I didn’t want to get stuck in that trap, there had to be a better way. As it just so happens, at the end of each semester, every student gets an email from the university imploring them to complete a survey and give feedback for each unit they did. They give you a seried of statements that you respond on a scale from strongly disagree to strongly agree. These statements include:
All this data is then published in a dark, dusty corner of the university website, accessible to all students, for every unit all the way back to 2005. There’s not many things that I trust more than my own intuition, but cold, hard data is one of them.
So I got my spreadsheet out and labored away for around 30 minutes. I went through each units survey data, plugged in some numbers and got an average score for each unit based on its results for the the past couple of years. And I had my answers. Turns out that all the units that I’d decided were interesting based on the unit descriptions were the most poorly rated overall — I was attracted to the units that sounded a little bit different, a little eccentric — but the best rated ones were the more traditional and monotonous sounding ones: Monetary Economics, Labor Economics, Public Finance. Urgh. Goodbye Economics of Developing Countries and Game Theory. I went back on my universities portal and changed my units to the more unimaginative ones… but data is never wrong.
This is when the gears in my head started turning. All this data is sitting there on the universities website, a goldmine of information on the performance of every unit taught at the university is left there untouched, unused, unfettered. No one even knows it exists. Most interesting of all was that the survey results for these units are far from uniform! “Overall satisfaction for this unit” has scores that range from 2 / 5 to 4.5 / 5 depending on the class. Certain units have consistently low ratings. The university uses these surveys to make decisions, but they are buried so deep in a far corner of the universities website on a page that looks like it’s from the dark ages that no student except maybe some nerd (me) would ever get any useful information from it.
The survey also asks you to rate your teachers’ and lecturers’ performance, but those answers are only available to select faculty members… bummer. But I started thinking. If only I could find which lecturer was teaching each unit in each specific semester, I could cross reference that with the survey results and deduce the lecturers performance. The unit guide! Every unit at the university has a ‘unit guide’, it give you the unit synopsis, the learning outcomes, assessment tasks AND… the lecturer teaching the unit. On top of that, the unit handbook —our universities' online directory that indexes every unit— links directly to the units guides. The more I thought about it, the more I was convinced it could work.
The idea was to design a simple website where you could search for any unit or lecturer. It would show a score for each unit based on the survey data and a score for each lecturer based on the scores of the units they taught in the past. I have a fair bit of front end development experience, so building something quick and dirty in React shouldn’t be too hard. I’d then need an API to query my database and get all the information I need for the front-end. Back-end work is not my forte, so that would require a bit of upskilling, but I could get it done. I’d then need to deploy all of that on a server, come up with a catchy domain name, post it on the private university students Facebook group and let it worm it’s way onto every students screen (yes I was getting quite wrapped up in the whole idea at that point).
So I got to work. First, building the data-set. I need a couple things. Each survey represents a unit taught during a semester at a campus (some units are taught at multiple campuses simultaneously), so my dataset needs to include unit, semester and campus. Then, I need to use the unit guide to find which lecturer was teaching that unit in that semester on that campus. Let’s get started.
I need to figure out how I’m going to structure the data that way I can start thinking about how I’m going to gather it. I get out excel and start to draw up my schema.
Unfortunately, none of these university websites have an “export to csv” button so I’m going to have to make my own. All of this stuff will have to be scraped off the websites’ HTML, python will be my best friend. I’ve never really done any web scraping before but I’ve heard the term Beautiful Soup thrown around so I’ll check that library out. I read through the documentation; it’s written concisely and it’s got all the functionality I need. Perfect.
First I need to scrape the unit handbook page for each unit (that index I was talking about). Each year a new handbook is released and it has a page for each unit outlining its name, the campuses where it is taught and in which semester; but most importantly it links to all the unit guides for that unit (there's one unit guide per semester). We have access to the archives for all the handbooks going all the way back to 1998, so getting past data is not going to be a problem.
So I start putting together my first web scraper. I grab all the data points I need: unit code, unit name, campus, semester, and the unit guide link for that semester. I test it out on a few pages.. it works. I can loop through all the pages in the handbook, put all the data together and dump it into a csv. There’s only one problem…
requesting page 1…
(1 second wait)
parsing..
requesting page 2…
(1 second wait)
parsing…
Each request to the handbook takes about one second, and since I’m going back three years, there’s around 15,000 pages I have to go through. That’s over four hours of requests. I could just leave it on overnight but nobody’s got that kind of time. Plus, I don’t want to run it for four hours to realize there’s an edge case I didn’t think about that stuffs up my whole data-set (and there’s always edge cases you don’t think about). I’ve got to make multiple requests at once.
Thankfully, asyncio, an event loop and coroutine based concurrency framework for python, makes this pretty easy. I set up an event loop with 100 workers and await my requests in an asynchronous coroutine. Turns out to be a lot easier than I’d imagined. I wrap it all up in a little batch_requests() utility function and test it out.
requesting pages 1…100
parsing
(10 second wait)
requesting pages 2…200
parsing
(10 second wait)
Much better! Now it’s the parsing that takes up the most time, since it has to go through 100 pages one by one. The whole thing now takes 25 minutes. It’s much better, around a tenfold decrease, but still kind of makes me cringe. Making more requests at a time wouldn’t change much at this point since the parsing is what really slows the program down, so if I want to speed things up even more I’d have to find a way to accelerate the parsing.
It turns out the multiprocessing library in python is super user-friendly. You create a multiprocessing.Pool(processes) object, tell it how many processes you want, and then you can call its pool.map(function, iterable) to run whatever function you want in the process pool. Works just like the normal map() method except it does it in parallel. I gave it my parsing function and the list of all the responses I got from my batch_requests() and let it do its magic.
requesting pages 1…100
parsing
(5 second wait)
requesting pages 100…200
parsing
(5 second wait)
Half the time! But wait… I’m pretty sure my CPU has four cores? Shouldn’t I get a four-factor performance increase? Well, unbeknownst to me, it turns out CPU's use something called hyper-threading. Even though my Mac has four logical cores (think like virtual core), it’s only got two physical ones. I didn’t look too deep into the details, but this apparently helps computers increase their performance. That’s why I’m only getting a two-factor performance increase. Oh well, good enough.
So now the data I have looks like this:
Now all I’ve got to do is scrape the lecturers form the unit guide and scrape the survey data.
So immediately I jump onto one of the unit guides to see what I’m dealing with. Ahh okay, so to get to the unit guide I need to authenticate with my university account. I thought about automating the whole authentication flow in my program — which admittedly would have been really cool — but the more I looked into it I realized that there was a lot of javascript rendering involved in the whole process and I didn’t need that kind of functionality. I type in a url for a unit guide, open up my browser’s dev tools, log in a couple times and figure out the two cookies used for authentication. So I jabber away at the keyboard and write a new script to go through the unit guide links from the original data and then ask me for new cookies whenever authentication fails.
$ python unit_guide_scraper.py
JSESSIONID: nijrNPEKQWWtmgFoOdUQAPjGMlbZq7U1
AWSELB: FD9E75345F81579D
It works! A tsunami of HTML floods my console, serotonin levels are peaking. Now that I know the requests are working it’s time to scrape the data. The unit guide page looks roughly like this:
Each semester has one unit guide, so if it’s being taught on multiple campuses, the unit guides lists each one and then gives you the lecturer for that campus. I’ve already got all the unit guide links, so my script can just go through each of them and find the spot in the unit guide that says Campus Lecturer(s). If the unit is being taught on multiple campuses, then I can look for a header that indicates the campus I want, then it’s a matter of appending the lecturer to the existing entry and dumping it all into a new csv. Easy peasy. Right?
So I get up a unit guide, inspect the html and start writing my scraping function. Lecturer header is a
<h2>
, yep cool, campus header, yep that’s a <h4>
. Looks good. Alright. Let’s run it.Okay, I’m getting a lot of errors, interesting. I modify my code slightly so that any time I get an error in my scraping function, it writes the link for the unit guide in an errors.log file along with the exception that occurred. I start going through them one by one to figure out what’s going on.
Ahhh, I see what’s happening in this one. They used a
<h3>
instead of a <h2>
for the Campus Lecturer(s) header! My scraper was looking for a <h2>
so that’s probably why all of these are failing. Oh okay this unit guide is a little bit different. The unit is being taught at three different campuses, but they’ve omitted the campus header, instead they’ve just got a little subtitle below the lecturer saying which campus they belong to. Wait, this unit is being taught at two different campuses but the unit guide has no indication of campus at all! No way to know which lecturer is responsible for which campus. And this one is only being taught on one campus but there’s multiple lecturers listed!I realise there are some complexities that I hadn't factored in. I got out my notebook and started jotting down all the different types of unit guides I might encounter, all the while thinking about how I would handle each case: one campus and one lecturer, one campus and multiple lecturers, uses a
<h2>
, uses a <h3>
, uses a little subtitle, doesn’t indicate campus at all, etc. After going through each unit guide with a fine tooth comb I run through my notes. Thirty-two different possible scenarios. How disheartening 😳😩😭For each of these scenarios, my beloved python script would have to detect which case it belongs to, and apply a different function based on that unit guide format. That’s a lot of work to do for potentially getting sued by the university.
So I kept thinking about it for a couple of days, but eventually, I gave up. The university, by using different formatting for all of their unit guides, had inadvertently crushed my plans of holding teachers accountable for the education they provide. I was a bit frustrated for not having realized my goal, but as with any project with a high level of uncertainty, I realized that running into a roadblock like this one was likely to happen. After all, when you design a system from the ground up, you control a lot of the variables. My project depended on a lot of external factors that were out of my control. That said, I should have tried scraping the unit guides before anything else. Since they’re essentially written by the teachers themselves then submitted to the university and converted to html (I imagine), they were likely to be difficult to scrape. Having read many unit guides in the past, I assumed “they all look kind of the same, they’ll be easy to parse.”
Nonetheless, I learnt many useful and/or interesting things working on this project:
Programming is 10% coding and 90% fixing bugs. That’s how we learn. Every time we scratch our head trying to figure out why the !&$# this isn’t working, we add an entry to the long list of mistakes that we hopefully won’t make again. I wanted to fix this bug in the education system, and I was close. Next time, I’ll read the unit guides first.
If you’re still reading on, there’s a couple of things I’d like to clear up. I did consider the ethical implications of individually rating lecturers and the effect it might have on them were this website to really take off. However, I also think that education is something not to be messed with. When my hairdresser gives me a bad haircut, it kind of sucks but hey, I’ll live. And yet I can probably find his star rating somewhere on Yelp. But when you have a bad teacher, that’s your career that’s being messed with. Your future. I think that when a student chooses his units he has a right to know if that unit or its lecturer has been poorly rated in the past.