The shutdowns brought an opportunity for my daughter to participate in virtual scouting events all over the United States. When the event registration form changed, I took the chance to try out some new web scraping skills while inspiring my daughter about the power of code for everyday tasks. My daughter is typically very involved in her local Girl Scout council. She has been a camp counselor and leader of events for younger girls. She goes the extra mile to earn badges and patches that expose her to new skills and activities. She is currently working on the highest award a Girl Scout can earn, the Gold Award. With the COVID shutdown, many of her local opportunities dried up, so she took the chance to explore activities at other councils offering virtual sessions. As just a sample: She learned to compose electronic music using javascript with a Florida council. During election season, she did a Rock the Vote workshop with a council in Washington State. And she earned her last two cybersecurity badges from some delightful leaders in Iowa. There's a lot of talent being shared out there in the councils. Recently, our council changed the registration form, so I can't search for activities outside a 90 mile radius. When I asked the help desk, they advised entering different ZIP codes. In the Girl Scout spirit of "using resources wisely" -- where the resource is my time -- I knew there was a code-based solution that would work better than manually entering a lot of ZIP codes. I also had in the back of my mind to show my daughter an example of web scraping, since she has thought about using these techniques in creating some community resources for her Gold Award... taking the chance to build up that STEM pillar. First, I inspected the form fields using the browser DOM inspector and found that most of our favorite councils used the same object names. This solution uses Selenium, Beautiful Soup, pandas, and pangres, plus a handful of other small libraries. # libraries selenium webdriver selenium.webdriver.chrome.options Options bs4 BeautifulSoup pandas pd sqlite3 urllib.parse urlparse hashlib datetime datetime pangres upsert sqlalchemy create_engine import from import from import from import import as import from import import from import from import from import I also used a webdriver to automate the form submission. # retrieve and parse html options = Options() # options.headless = True options.add_argument( ) # webdriver Chrome DRIVER_PATH = driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH) "--window-size=1920,1200" for '/Applications/chromedriver' I gathered a list of some of our favorite council form URLs. # add urls to list here needed; some our favorite councils virtual events urls = [ \ , \ , \ , \ , \ ] as of for "https://www.nccoastalpines.org/en/activities/activity-list.advanced.html" "https://www.girlscoutstoday.org/en/events/event-list.advanced.html" "https://www.girlscoutsww.org/en/events/event-list.advanced.html" "https://www.citrus-gs.org/en/events/event-list.advanced.html" Information would be stored in lists, so I set those up first. # initiate list buckets links = [] titles = [] startdates = [] enddates = [] councils = [] uniquekeys = [] Going through a loop of each URL, I used the webdriver to submit the forms and collect the page sources. url urls: # click and submit driver.get(url) # driver.find_element_by_id( ).click() # driver.find_element_by_xpath( ).click() driver.find_element_by_xpath( ).click() driver.find_element_by_id( ).submit() # here is the page, make the soup page_source = driver.page_source soup = BeautifulSoup(page_source, ) # get council identifier url council = urlparse(url).netloc.split( )[ ] primarylink = urlparse(url).netloc for in "sf-activities:program-level/seniors" "//input[contains(@id, 'program-level/senior')]" "//input[contains(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'program-level/senior')]" "sub" 'lxml' from "." 1 Then, still in the loop, I could parse the resulting html with Beautiful Soup. I also added a hashed unique id in place of the default ID, and parsed the council identifier out of the url. # now get the data on each page x soup.find_all(class_= ): link = title = startdate = enddate = link = primarylink + x.h6.a[ ] links.append(link) title = x.h6.a.text titles.append(title) startdate = x.find_next( , itemprop= )[ ] startdates.append(startdate) enddate = x.find_next( , itemprop= )[ ] enddates.append(enddate) councils.append(council) uniquekey = hashlib.sha224(str(link + startdate).encode( )).hexdigest() uniquekeys.append(uniquekey) driver.quit() for in 'eventsList eventSection' "" "" "" "" "href" "span" "startDate" 'content' "span" "stopDate" 'content' 'utf-8' I didn't strictly need to put this in a dataframe, but it worked better with pangres, and I might extend the code to do some transformation. # put data a dataframe # a df not strictly needed here, except pangres uses it df = pd.DataFrame({ :uniquekeys, :links, :titles, : startdates, : enddates, : councils, }) # set an index, code below will need it; also removes ID df.set_index( , inplace=True) in 'uniquekey' 'link' 'title' 'startdate' 'enddate' 'council' as this default 'uniquekey' Here's the table definition for reference. # CREATE TABLE ( # TEXT primary key, # TEXT, # TEXT, # TEXT, # TEXT, # TEXT, # DATETIME DEFAULT CURRENT_TIMESTAMP # ); "events" "uniquekey" "link" "title" "startdate" "enddate" "council" "updateddate" Because the events didn't have a created date, I added an automatic create date to allow me to identify new entries and then the pangres library to do upserts to sqlite. This allows me to identify new entries and potentially send a notification email with that listing. engine = create_engine( ) upsert(engine=engine, df=df, table_name= , if_row_exists= , dtype=None) 'sqlite:////tmp/gs.sqlite' 'events' 'update' I can easily view the data stored in sqlite via TablePlus. Finally, I ran a quick delete on events that had expired. sqlalchemy text deleteExpiredSQL = f engine.connect() connection: result = connection.execute(text(deleteExpiredSQL)) from import "delete from events where enddate<'{str(datetime.now())}'" with as I can run this periodically now to find new events to share with my daughter. Right now, I don't want to put it on cron, but that's an option in the future. (base) vparham@Valeries-Air girlscoutevents % python gs-events.py The complete code is here: https://github.com/dataindataout/girlscoutevents