While building ScrapingBee I'm always checking different forums everyday to help people about web scraping related questions and engage with the community.
This is very common for early stage startup. There are many benefits to engage with potential customers by answering their questions.
First you get to know them better, and it can give ideas for product development.
And then, you provide value and it make them trust you.
Some forums have a way to send you alerts about keyword of tags, others don't.
Today we are going to see how you can quickly create a CRON job with AWS Lambda and Python to check for keywords in a forum topic.
In order to scaffold and deploy our project to AWS lambda, we will use the Serverless framework.
It's a great project that makes building/configuring your cloud functions really easy with a simple configuration file.
It handles many different clouds (AWS, Google Cloud, Azure...) and different languages.
Here are the instruction to install it: https://serverless.com/framework/docs/providers/aws/guide/quick-start/
We will use the very popular Python packages Requests and BeautifulSoup to parse the HTML code:
pip install requests
pip install beautifulsoup4
pip freeze > requirements.txt
If we didn't use the Serverless framework, you would need to package the dependencies into a Zip and upload it to AWS.
Thanks to Serverless, we can use a plugin that will parse the requirements.txt file and automatically take care of packaging the dependencies into a Lambda Layer.
To do so:
npm init
After accepting all the defaults, add this to your serverless.yml:
# serverless.yml
plugins:
- serverless-python-requirements
custom:
pythonRequirements:
dockerizePip: non-linux
You can get more information about this here: https://serverless.com/blog/serverless-python-packaging/
We are going to monitor Keywords on IndieHackers.com a popular forum for bootstrapped founders.
Here is a simple code, that will check all titles for "design":
import json
import requests
from bs4 import BeautifulSoup
def hello(event, context):
base_url = "https://www.indiehackers.com/"
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
matches = soup.select('a.feed-item__title-link')
keyword = 'design'
matching_links = []
for i in matches:
if keyword in i.text:
matching_links.append(base_url + i.get('href'))
response = {
"statusCode": 200,
"body": matching_links
}
return response
Now all we need to do is to send a Slack notification (or email ) when something matches our keyword.
It's really easy with Slack, you just have to create an app to get a webhook URL as explained here.
json = {"text": f"Found a topic matching the keyword on Indie Hackers: {matching_links}"}
slack_request = requests.post(
WEBHOOK_URL, json=json, headers={"Content-Type": "application/json"}
)
And "voilà" for the code.
In order to invoke your function:
serverless invoke -f hello --log
To automate the function invocation with a CRON job:
functions:
hello:
handler: handler.hello
events:
- schedule: rate(1 day)
There are different ways to write schedule expression with AWS, you can find a detailed article here
And now de the deployment command:
serverless deploy
And that's it, easy right?
I hope you liked this article, this was a little introduction to the serverless framework and how easy it is to build simple utility scripts like this.
If you like web scraping, I just wrote an article about the different web scraping tools available, don't hesitate to take a look.
Stay tuned for other blog posts about web scraping :)