Legislation is the source code of our society, but it’s often written in a way that’s inaccessible to the very people it governs. Bills can be dozens of pages long, filled with dense legalese and cross-references that make them nearly impossible to read casually. As a developer I believe in using technology to make government more transparent and accessible. I looked at the Texas Legislature’s website and saw a challenge.
What if I could use data to train an AI to read any bill and write its own summary? This is the story of how I built an end-to-end pipeline to do just that: from scraping the data, to cleaning the mess, to fine-tuning a powerful language model.
The first step was to get the URLs for every bill that had passed, along with its full text and official summary. The Basic Toolkit: BeautifulSoup For simple HTML pages and Python's urllib are the perfect tools. I started by writing a script (all_house_texas.py) to navigate the Bills Filed pages, find all the links to individual bill histories, and extract basic information like the author and caption.
Part 1: Scraping a 20th-Century Government Website
# From: leg_module_test.py
# A simple function to turn a URL into a BeautifulSoup object
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "lxml")
return soupdata
website ='https://capitol.texas.gov/BillLookup/History.aspx?LegSess=89R&Bill=SB1'
soup = make_soup(website)
# Find the author's name by its specific HTML ID
for data in soup.findAll('td', id = 'cellAuthors'):
print(data.text)
This worked well for the initial data, but I quickly hit a wall.
Leveling Up with Selenium for JavaScript
The most valuable piece of data wasn't a standard link. It was hidden behind a JavaScript onclick event that opened a new popup window. BeautifulSoup can't execute JavaScript, so it couldn't see the link.
This is a common challenge in modern web scraping. The solution? Selenium, a tool that automates a real web browser. Using Selenium, my script could load the page, wait for the JavaScript to render, and then interact with the button just like a human would.
My script texas_leg_txt_from_list.py used this approach to finally get the summary URL.
# From: texas_leg_txt_from_list.py
# Using Selenium to find an element and extract its JavaScript-based URL
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
# ... (driver setup code) ...
driver.get('https://capitol.texas.gov/BillLookup/Text.aspx?LegSess=75R&Bill=HB1')
time.sleep(1.5) # Wait for the page to load
# Find the summary link by its ID
bill_summary_link = driver.find_element(By.ID, 'lnkBillSumm')
# Get the 'onclick' attribute, which contains the JavaScript function call
onclick_attribute = bill_summary_link.get_attribute('onclick')
# A bit of string manipulation to extract the URL from the JavaScript
start_index = onclick_attribute.find("('") + 2
end_index = onclick_attribute.find("'", start_index)
bill_summary_url = 'https://capitol.texas.gov/BillLookup/' + onclick_attribute[start_index:end_index]
print(f"Successfully extracted summary URL: {bill_summary_url}")
driver.quit()
With this, I could reliably gather my three key pieces of data for thousands of bills: the main page URL, the full enrolled text, and the official summary text.
Part 2: The Cleanup Crew - Turning HTML Chaos into Clean Data
Web data is messy. The scraped text was littered with extra whitespace, non-breaking space characters (\xa0), and other HTML artifacts. Before I could feed this to a model, it needed a serious cleanup.
My start_analysis_0.py script was dedicated to this crucial step.
First, I wrote a simple cleaning function using regex to standardize whitespace and remove junk characters.
# From: start_analysis_0.py
# A function to clean raw text scraped from the web
import re
def clean_text(text):
if not isinstance(text, str):
return "" # Handle cases where data might not be a string
text = re.sub(r'\s+', ' ', text) # Collapse all whitespace into single spaces
text = text.replace('\xa0', ' ') # Remove non-breaking spaces
text = text.replace('__','')
# ... other replacements ...
return text.strip()
Next, I implemented a critical quality control check. A good summary should be significantly shorter than the original text, but not so short that it's useless. I decided to only keep pairs where the summary's length was between 10% and 100% of the full bill's length. This filtered out bad scrapes and irrelevant data. After processing thousands of bills, I plotted a histogram of this ratio.
This visualization confirmed that most of my data fell into a healthy distribution, validating my filtering logic. The final, cleaned dataset was saved as a single JSON file, ready for the main event.
Part 3: Fine-Tuning a T5 Model
Now for the fun part. I chose to use the t5-small model, a powerful and versatile Text-to-Text Transfer Transformer. T5 is great for instruction following tasks. By prepending the text with summarize, I could teach the model that its job was to output a summary. My training pipeline was built in start_analysis_1.py and start_analysis_2.py using the Hugging Face transformers and TensorFlow libraries.
The core process looks like this:
- Tokenization: This step converts the text into numerical IDs that the model can understand. I created a pre-processing function to handle this for both the bill text (the input) and the summary (the target label).
# From: start_analysis_1.py
# This function tokenizes the text and prepares it for the model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-small')
prefix = "summarize: "
def preprocess_function(examples):
# Prepare the input text with the "summarize: " prefix
inputs = [prefix + doc for doc in examples["source_text"]]
model_inputs = tokenizer(inputs, max_length=512, truncation=True)
# Tokenize the target summaries
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["target_text"], max_length=128, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
- Data Loading: I loaded the cleaned JSON into a Hugging Face Dataset object and applied the tokenization.
- Training: Finally, I configured the model, set up an optimizer, and kicked off the training process with model.fit(). This is where the magic happens. The model iterates through the data, making predictions, comparing them to the official summaries, and adjusting its internal weights to get better and better.
# From: start_analysis_1.py
# The final training call in TensorFlow
import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM, AdamWeightDecay
# ... (dataset and data_collator setup) ...
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.compile(optimizer=optimizer)
# Let's train!
model.fit(
x=tf_train_set,
validation_data=tf_test_set,
epochs=3
)
# Save the fine-tuned model for later use
model.save_pretrained("./t5_small_texas_bills")
Part 4: The Verdict - Did It Work?
After hours of training, the moment of truth arrived. I fed the model a bill it had never seen before to see what it would produce.
Here's an example:
**Original Bill Text (Snippet): \ "...AN ACT relating to providing for a reduction of the limitation on the total amount of ad valorem taxes that may be imposed by a school district on the residence homesteads of the elderly or disabled to reflect any reduction in the school district's tax rate and protecting a school district against any resulting loss in local revenue..."
**Official Summary (Ground Truth): \ "This Act reduces the limitation on school district ad valorem taxes for elderly or disabled homeowners to reflect tax rate reductions. It ensures that school districts are compensated for any resulting revenue loss. The change applies starting with the 2007 tax year, contingent on voter approval of a related constitutional amendment."
**AI-Generated Summary: \ "The bill relates to a reduction of the limitation on the total amount of ad valorem taxes that may be imposed by a school district on the residence homesteads of the elderly or disabled. The bill would also protect a school district against any resulting loss in local revenue."
Analysis: The model correctly identified the main subjects (ad valorem taxes, school districts, elderly/disabled homesteads) and the core action (reducing the tax limitation). While not as eloquent as the human summary, it's accurate and captures the essence of the bill perfectly. It is a functional and useful TL;DR.
Conclusion
This project was a powerful reminder that the most impactful AI applications often rely on a foundation of gritty data engineering. Building the model was the final 20% of the work; the first 80% was the painstaking process of acquiring and cleaning the data.
The final model isn't perfect, but it's a powerful proof-of-concept. It demonstrates that we can use modern AI to make complex civic information more accessible to everyone.
What's next?
- Deploying as an API: Turn this into a public service where anyone can submit a bill's text and get a summary.
- Creating a Bot: To post summaries of newly filed bills.
- Expanding to other states: Every legislature has the same problem. This entire pipeline could be adapted.
If you’re a developer, I encourage you to look at your own local government's data. You might find a similar opportunity to build something that not only sharpens your skills but also serves your community.
