A growing number of companies today are using alternative data sources to gain a competitive advantage in the fierce global race. The big data and analytics market is expected to surpass
As the numbers show, utilizing data is a costly endeavor. It requires powerful data harvesting software and large teams of analysts to clean, process, and make sense of the collected information. In order to save a pretty penny, businesses are looking for ways to automate the data cycle and eliminate the dependency on data professionals.
So, it is no surprise that the recent hype around generative artificial intelligence (AI) solutions raised the question of whether modern AI systems could fully automate web data collection and analysis. To answer it, one must look into two distinct subjects – the limitations of today’s AI systems and the peculiarities of web scraping.
Web scraping involves many repetitive tasks and thus seems like a perfect candidate for combining AI and robotic process automation. However, efforts to develop a generic AI data extractor haven’t been all that successful so far.
When we talk about data collection at scale, developing generic AI is an extremely challenging task. A few no-code tools designed for web scraping already exist, but they have limited scalability and usually serve a specific purpose – for example, training scraping robots. Part of the challenge lies in the mechanics of AI itself and part – in the specific web scraping features.
Almost every AI system today is built on various iterations of machine learning (ML) technology, whose performance depends on the quality, accuracy, and scope of the training data. The ML model can partially process real-time data, like Microsoft Bing, or learn from static datasets that must be constantly updated.
Getting the right amount of superior-quality data and cleaning it of errors and biases is the most pressing task for ML developers. The training process also involves a lot of manual work, such as labeling. By one estimate, a supervised deep-learning algorithm needs a dataset with around
At Oxylabs, we faced this challenge directly when training our
Supervised transfer learning, reinforcement learning, and unsupervised techniques, such as generative modeling, offer a glimpse of hope in solving the manual work challenge. However, these techniques also have drawbacks, such as lack of accuracy, difficulties when shifting between source and target domains, poor categorization of new data, and high computational costs. Moreover, they require highly specific knowledge, which makes them unattainable to many businesses.
The second part of the challenge is the peculiarities of the web scraping process. The most critical part of collecting public web data at scale is quality. Data must be accurate and intentional – obsolete, erroneous, or redundant data can trash entire datasets and produce wrong analytic insights.
However, ensuring data quality is a tedious task. Web data is scattered in multiple sources and formats, often granular and sensitive to the surrounding context and use case. Additionally, the internet is in a perpetual state of change. Almost any website is different, and its structure constantly changes, which requires updating the scraping scripts thoroughly to keep the data flowing. Due to the volatile nature of the internet, breaks in the scraping and parsing pipelines are inevitable over time.
Therefore, the web scraping AI would have to be able to determine which data is accurate and relevant, bypass anti-scraping measures, adapt to the changes in pages’ structure (maintaining and debugging the code if necessary), and, at the same time, navigate different legal regulations around data privacy, sites’ Terms of Service, etc. It would also have to take cloud storage decisions, which could be quite odd and expensive if AI is unsupervised. Even if data extraction is of limited scale, this is already a lot to consider for the machine unless it reaches the superintelligence level.
Let’s say the AI has to decide whether publicly available information on a person's social media account can be considered private data and, in such a case, shouldn’t be scraped. There is a lot of complexity when it comes to private data regulation – one has to comply with existing laws, legal precedents, the site’s Terms of Service, etc. The decision is often of an interpretative nature and cannot be easily determined by a set of ML predictions.
So, a single optimal way to scrape any page does not exist since web content is noisy, inconsistent, and follows no strict rules. If data collection is of small scale and quality or accuracy issues are not critical for decision-making, generic AI solutions might perform decently well. If, on the contrary, web scraping is performed on a large scale, from multiple sources or sensitive targets, and data quality is critical, constant human supervision of the AI systems will be necessary, resulting in higher costs and process bottlenecks.
Even if it is too early to talk about a general AI solution for data collection, automating specific parts of the data extraction process is a natural step to improve the technology further. The usual suspects for AI automation are parsing, browser fingerprinting, and product mapping.
As much as AI can improve specific data extraction processes, web scraping can aid ML technology by providing diverse and high-quality data for model training. Therefore, advancements in web scraping capabilities lead to more sophisticated ML models and vice versa.
At Oxylabs, we have already developed several patented ML-powered solutions for automating repetitive web scraping tasks. The most advanced is Web Unblocker, which uses a simple back-connect proxy to extract public data from sites with rigid anti-scraping measures.
The product runs on a couple of ML models. The first one is based on reinforcement learning and dynamically serves unique fingerprints by testing which set of parameters is the most effective for unlocking a specific page. The second performs automated response recognition, automatically checking the responses and deciding if they carry relevant data, not some hidden block or CAPTCHA. It creates an effective feedback loop between the scraping results and the experimentation engine to determine the outcome quality.
Yet another successful solution is the Adaptive Parser which covers the parsing part in our Scraper API routines. The model structures specific data elements by locating their HTML attributes from a particular website. Basically, it assigns parts of HTML strings to specific categories and fields, such as price, title, description, and others.
Developing these solutions has taught us some lessons. First, constant re-training of ML models is a must to keep decent success rates. Circumstances surrounding web data are dynamic – the structures of web pages change, making it difficult for the algorithm to recognize where the data is located; moreover, new ML-driven IP blocking and anti-scraping measures emerge constantly.
Second, finding enough good-quality data for ML training is the most challenging issue. Data gathering wasn’t a problem for us, but data labeling definitely was, as it required a lot of manual work. Even so, sometimes, the data is just rare and hardly available. For example, we had to find examples of e-commerce sites in almost any human language to train our Adaptive Parser as it must be able to perform the following critical tasks: first, determine if the data is in English; if not, decide what language it is dealing with, and then – translate it to English.
Despite the initial effort, automating data parsing saved us considerable time later. With AI, there’s no need to readjust a dedicated parser for an individual web page – the trained model recognizes specific data points even after layout changes as it holds a collection of knowledge based on HTML tags usage, their relative positions, and other attributes. The more datasets it absorbs during the training, the better it becomes at structuring data extracted from similar websites.
Developments in AI and web scraping technologies strongly reinforce each other – one technology can offer extensive breakthroughs in the other. At Oxylabs, we believe that AI will be the key catalyst of web scraping advancements.
However, it is unlikely that generic AI solutions will replace specialized web scraping software and human work in the near future due to the complex and dynamic nature of websites, variability in data sources, and legal and ethical considerations. AI might help an organization improve its data structuring or speed up some data collection processes, but it can hardly deliver the seismic changes businesses are looking for today.