All scrapers can be split into 2 categories that require different infrastructure and techniques:
Now let’s try to design a system that can execute targeted scrapers regularly. The system only has one functional requirement: it must be able to execute arbitrary scrapers on schedule.
What about non-functional requirements?
Here’s an example of how such a system could look like.
In result we have following components:
Scraper configs storage stores information about scraper schedule and parameters that are passed to the scraper logic implementation. For example this could be a static configuration file (if you don’t need to reconfigure scrapers on the fly) or standalone database.
Scrapers jobs storage is used to store results and status of the scraper task.
Scheduler enqueues scraper jobs based on their configs. It needs to make sure that there is no other active job before enqueuing a new one.
Worker consumes the jobs from the queue and executes the scraper logic. It also keeps job status up-to-date, for example it should mark a job as completed or failed after the logic executing is finished.
Now let’s see how everything works together:
Scheduler loads all scrapers configs and starts scheduling the jobs.
Scheduler registers a job in Scrapers jobs storage with a status Queued
Scheduler enqueues registered job in the Jobs queue
Worker dequeues a scraper job from the Jobs queue and marks it as Started
Worker spawns Scraper runner that executes scraper logic. It also starts a thread which continuously pings that the job is still active so Scheduler doesn’t enqueue another job
Scraper runner executes scraper logic.
All of the requests made by the scraper logic are intercepted by Requests middleware which can be used for:
Scraper logic may save its results somewhere or call some APIs based on data it has scraped
When the scraper logic is executed, the Scraper runner marks the job as Succeeded
or Failed
.
How does Scraper runner execute scrapers logic? There are multiple various options. Let’s discuss how we can solve this problem in Python since most scrapers are most commonly written using it.
One of the approaches that is used in
In this case scraper developers and platform owners are decoupled.
Another approach can be used when it’s not required for the scraper developers and the platform owners to be decoupled, e.g. when only one team develops scrapers and maintains the scraping platform.
In this case what can be done is to have a scraper runner implementation which imports all the scrapers implementations and then executes them based on what it has dequeued from the queue. Here’s an example implementation:
# Scraper implementation interface
class ScraperABC(ABC):
@property
@abstractmethod
def name(self):
...
@abstractmethod
def execute(config: Any) -> Any:
...
# Scraper config that is passed to the Scraper implementation
@dataclass
class ScraperConfig:
...
# Sample Scraper implementation
class SampleScraper(ScraperABC):
@property
@abstractmethod
def name(self):
return "sample_scraper"
@override
def execute(config: ScraperConfig) -> Any:
# <do the actual scrapping>
text = requests.get("https://example.com").text
return text[:100]
class Worker:
def __init__(self, scrapers: list[ScraperABC]):
self.queue = ...
self.scrapers = {
scraper.name: scraper
for scraper in scrapers
}
# Scraper runner implementation
def execute_scraper(self, name: str, config: Any) -> Any:
if name not in self.scrapers:
raise Exception(f"Unknown scraper {name}")
return self.scrapers[name].execute(config)
# Implement consuming the queue
def start(self):
while True:
task = self.queue.dequeue()
self.execute_scraper(
task.scraper_name,
task.scraper_config
)
if __name__ == "__main__":
Worker(
scrapers=[
SampleScraper()
]
).start()
This approach is simpler to maintain and has less caveats although it has some limitations:
The availability and scalability of the platform comes down to its components.
To have a bit more clear understanding on which components to choose let’s do some back-of-the-envelope calculations. Imagine we have:
In result the final setup can look something like this:
There are two parts that needs to be monitored: the platform itself and executed scrapers.
Let’s start with the platform monitoring. We want to have an observability over following things:
Resource utilisation: CPU, Memory, Disk usage of all the components
Queue metrics:
Scheduler metrics:
Worker metrics
All these metrics should tell us how well platform performs overall.
As a platform we could also expose some metric for the scrapers developers so they can keep them up-to-date easily. Here are some metrics that come to mind:
There are actually not that many existing solutions that provide full fledged scraping experience.
The most popular is Scrapy - an open source and collaborative framework for extracting the data from websites. In a fast, simple, yet extensible way. There are multiple ways to host it:
Crawlee is another popular solution to create and run scrapers that use headless browsers such as Playwright or Selenium. Although there’s no built-in scheduling and monitoring capabilites so you’ll either need to implement them yourself or use Apify to host the scrapers.
Another notable option is to use no-code solutions like parsehub or octoparse which allows you to train scrapers by clicking on what you want to scrape.