The shutdowns brought an opportunity for my daughter to participate in virtual scouting events all over the United States. When the event registration form changed, I took the chance to try out some new web scraping skills while inspiring my daughter about the power of code for everyday tasks.
My daughter is typically very involved in her local Girl Scout council. She has been a camp counselor and leader of events for younger girls. She goes the extra mile to earn badges and patches that expose her to new skills and activities. She is currently working on the highest award a Girl Scout can earn, the Gold Award.
With the COVID shutdown, many of her local opportunities dried up, so she took the chance to explore activities at other councils offering virtual sessions. As just a sample: She learned to compose electronic music using javascript with a Florida council. During election season, she did a Rock the Vote workshop with a council in Washington State. And she earned her last two cybersecurity badges from some delightful leaders in Iowa. There's a lot of talent being shared out there in the councils.
Recently, our council changed the registration form, so I can't search for activities outside a 90 mile radius. When I asked the help desk, they advised entering different ZIP codes. In the Girl Scout spirit of "using resources wisely" -- where the resource is my time -- I knew there was a code-based solution that would work better than manually entering a lot of ZIP codes.
I also had in the back of my mind to show my daughter an example of web scraping, since she has thought about using these techniques in creating some community resources for her Gold Award... taking the chance to build up that STEM pillar.
First, I inspected the form fields using the browser DOM inspector and found that most of our favorite councils used the same object names.
This solution uses Selenium, Beautiful Soup, pandas, and pangres, plus a handful of other small libraries.
# import libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import sqlite3
from urllib.parse import urlparse
import hashlib
from datetime import datetime
from pangres import upsert
from sqlalchemy import create_engine
I also used a webdriver to automate the form submission.
# retrieve and parse html
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
# webdriver for Chrome
DRIVER_PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
I gathered a list of some of our favorite council form URLs.
# add urls to list here as needed; some of our favorite councils for virtual events
urls = [ \
"https://www.nccoastalpines.org/en/activities/activity-list.advanced.html", \
"https://www.girlscoutstoday.org/en/events/event-list.advanced.html", \
"https://www.girlscoutsww.org/en/events/event-list.advanced.html", \
"https://www.citrus-gs.org/en/events/event-list.advanced.html", \
]
Information would be stored in lists, so I set those up first.
# initiate list buckets
links = []
titles = []
startdates = []
enddates = []
councils = []
uniquekeys = []
Going through a loop of each URL, I used the webdriver to submit the forms and collect the page sources.
for url in urls:
# click and submit
driver.get(url)
# driver.find_element_by_id("sf-activities:program-level/seniors").click()
# driver.find_element_by_xpath("//input[contains(@id, 'program-level/senior')]").click()
driver.find_element_by_xpath("//input[contains(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'program-level/senior')]").click()
driver.find_element_by_id("sub").submit()
# here is the page, make the soup
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# get council identifier from url
council = urlparse(url).netloc.split(".")[1]
primarylink = urlparse(url).netloc
Then, still in the loop, I could parse the resulting html with Beautiful Soup. I also added a hashed unique id in place of the default ID, and parsed the council identifier out of the url.
# now get the data on each page
for x in soup.find_all(class_='eventsList eventSection'):
link = ""
title = ""
startdate = ""
enddate = ""
link = primarylink + x.h6.a["href"]
links.append(link)
title = x.h6.a.text
titles.append(title)
startdate = x.find_next("span", itemprop="startDate")['content']
startdates.append(startdate)
enddate = x.find_next("span", itemprop="stopDate")['content']
enddates.append(enddate)
councils.append(council)
uniquekey = hashlib.sha224(str(link + startdate).encode('utf-8')).hexdigest()
uniquekeys.append(uniquekey)
driver.quit()
I didn't strictly need to put this in a dataframe, but it worked better with pangres, and I might extend the code to do some transformation.
# put data in a dataframe
# a df not strictly needed here, except pangres uses it
df = pd.DataFrame({
'uniquekey':uniquekeys,
'link':links,
'title':titles,
'startdate': startdates,
'enddate': enddates,
'council': councils,
})
# set an index, as code below will need it; this also removes default ID
df.set_index('uniquekey', inplace=True)
Here's the table definition for reference.
# CREATE TABLE "events" (
# "uniquekey" TEXT primary key,
# "link" TEXT,
# "title" TEXT,
# "startdate" TEXT,
# "enddate" TEXT,
# "council" TEXT,
# "updateddate" DATETIME DEFAULT CURRENT_TIMESTAMP
# );
Because the events didn't have a created date, I added an automatic create date to allow me to identify new entries and then the pangres library to do upserts to sqlite. This allows me to identify new entries and potentially send a notification email with that listing.
engine = create_engine('sqlite:////tmp/gs.sqlite')
upsert(engine=engine,
df=df,
table_name='events',
if_row_exists='update',
dtype=None)
I can easily view the data stored in sqlite via TablePlus.
Finally, I ran a quick delete on events that had expired.
from sqlalchemy import text
deleteExpiredSQL = f"delete from events where enddate<'{str(datetime.now())}'"
with engine.connect() as connection:
result = connection.execute(text(deleteExpiredSQL))
I can run this periodically now to find new events to share with my daughter. Right now, I don't want to put it on cron, but that's an option in the future.
(base) vparham@Valeries-Air girlscoutevents % python gs-events.py
The complete code is here: https://github.com/dataindataout/girlscoutevents