paint-brush
Writing a Web Scraper With ChatGPT. How Good is It?by@pigivinci
11,599 reads
11,599 reads

Writing a Web Scraper With ChatGPT. How Good is It?

by Pierluigi VinciguerraApril 17th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In November, OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere. It’s time to have a closer look at AI for web scraping. Can AI write scrapers for us? At the moment, we cannot expectChatGPT to write a fully working scraper for us.
featured image - Writing a Web Scraper With ChatGPT. How Good is It?
Pierluigi Vinciguerra HackerNoon profile picture

In November, after OpenAI released ChatGPT based on GPT-3.5, news about it spread like wild fire. On that occasion, I wrote about AI and web scraping, and followed it up by asking interlocutors what their point of view was on the state of AI for the web scraping industry.


Five months later, we have GPT-4 and tons of applications have been built on top of GPT models so it’s time to have a closer look at AI for web scraping.


a close up of a computer screen with the word chat on it

Can AI write scrapers for us?

At the moment, we cannot expect ChatGPT to write a fully working scraper for us on a chosen website. It will return a syntactically correct scraper but with generic selectors, not useful for our case. If we ask to scrape some well-known website maybe it could return some correct mapping, if the answer was already given in some place like Stackoverflow in the past.


Given that, let’s try to build a scraper from scratch with ChatGPT for a niche website such as https://www.gianvitorossi.com/it_it/


I’ll go straight to the final prompt I wrote to get a correct Scrapy spider, where I asked to map four fields and explained to ChatGPT the access plan to the website.


Create a Python web scraper using the Scrapy framework and XPath selectors.


The start_urls parameter should contain the URL [https://www.gianvitorossi.com/it_it](https://www.gianvitorossi.com/it_it) to be scraped.


After scraping the first page, the scraper should enter a function called get_categories.


In this function the scraper extracts all the relative links to different product types, using the following XPath selector: product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()

For each product page, it should enter a function called get_product_list.


Extract the list of products on the page using the XPath selector (’//a[contains(@class, “b-product_tile-image_link”)]/@href).


If there are more products to be shown, crawl to the following product page using the XPath ('//a[data-event-click.prevent="loadMore"]/@href).extract()[0], which contains the absolute path to the next page, and include this selector in a try/except clause. If there is no match for the selector, a pass instruction should be used in the except clause.


For each product extracted, enter a function called get_product_details.


Write the logic to scrape the following fields:

- product code

- full_price

- price

- product_url


The product detail page contains a JSON containing most of the data we need. It can be found by looking for the first occurrence of the script type="application/ld+json" and extract the text from there.


It follows the standard Product [Schema.org](http://schema.org/) type and you can map the field product_code to the Json's mpn field.


The field price can be mapped by entering the offers dict in the JSON and selecting the price field.


The product_url field can instead be mapped using the response.url value.


Since there are no discounts on this website, full_price field can be mapped as price field.


And here’s the code given in the response:

import scrapy
import json

class GianvitoRossiSpider(scrapy.Spider):
    name = 'gianvitorossi'
    start_urls = ['https://www.gianvitorossi.com/it_it']

    def parse(self, response):
        # Extract product categories
        product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()
        for product_page in product_pages:
            yield scrapy.Request(response.urljoin(product_page), callback=self.get_product_list)

    def get_product_list(self, response):
        # Extract product links
        product_links = response.xpath('//a[contains(@class, "b-product_tile-image_link")]/@href')
        for link in product_links:
            yield scrapy.Request(response.urljoin(link.extract()), callback=self.get_product_details)

        # Check for more product pages
        try:
            next_page = response.xpath('//a[data-event-click.prevent="loadMore"]/@href').extract()[0]
            yield scrapy.Request(response.urljoin(next_page), callback=self.get_product_list)
        except:
            pass

    def get_product_details(self, response):
        # Extract JSON data
        json_data = response.xpath('//script[@type="application/ld+json"]/text()').extract_first()
        data = json.loads(json_data)

        # Extract product details
        product_code = data['mpn']
        full_price = data['offers']['price']
        price = full_price
        product_url = response.url

        # Return item
        yield {
            'product_code': product_code,
            'full_price': full_price,
            'price': price,
            'product_url': product_url
        }


The scraper runs correctly and retrieves all the items. But does writing the prompt needed to create the spider required less time than writing the spider by myself? Difficult to say. In both cases, I need to write down the XPath selectors, so AI didn’t save me time. Let’s suppose we have a properly engineered prompt and a Scrapy spider template to reuse, we can assume the time consumed modifying the prompt or the scraper is the same. The only improvement in the process is that ChatGPT wrote the scraper without any bugs.

Bugfixing with the AI

Let’s test the utility of ChatGPT when we’re in front of a bug in our scrapers. I’ll modify the previous spider by deleting a quote from an Xpath and changing the name of a variable non-initiated.

    def get_product_details(self, response):

        # Extract JSON data

        json_data = response.xpath('//script[@type="application/ld+json"]/text()).extract_first()

        data = json.loads(json_data)

        # Extract product details

        product_code = data['mpn']

        full_price = data['offers']['price']

        price = full_price

        product_url = response.url

        # Return item

        yield {

            'product_code': product_code3,

            'full_price': full_price,

            'price': price,

            'product_url': product_url

        }

Will ChatGPT be able to fix it?


Yes and not only it returned the corrected code but also explained what it fixed.


The error in the original code was a syntax error in the XPath selector used to extract the JSON data. The closing quotation mark was placed incorrectly, resulting in an invalid selector. The corrected code places the closing quotation mark in the correct location, allowing the selector to extract the JSON data correctly. Additionally, there was a typo in the name of the product_code field in the yield statement, which has been fixed in the corrected code.


This is a helpful way to use ChatGPT since sometimes we spend a lot of time looking for the solution to these silly mistakes.

Porting to another language

Instead of creating a scraper with Python, I used the same prompt to create a Node.js scraper. I don’t know anything about this language and let’s see if ChatGPT can help me in creating my first scraper in Node.js


Here’s the result:

const axios = require('axios');
const cheerio = require('cheerio');

const startUrl = 'https://www.gianvitorossi.com/it_it';

async function scrape(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product categories
    const productPages = $('a[role="menuitem"]').map(function() {
      return $(this).attr('href');
    }).get();

    for (const productPage of productPages) {
      await getProducts(`https://www.gianvitorossi.com${productPage}`);
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProducts(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product links
    const productLinks = $('a.b-product_tile-image_link').map(function() {
      return $(this).attr('href');
    }).get();

    for (const link of productLinks) {
      await getProductDetails(`https://www.gianvitorossi.com${link}`);
    }

    // Check for more product pages
    try {
      const nextPage = $('a[data-event-click.prevent="loadMore"]').attr('href');
      await getProducts(`https://www.gianvitorossi.com${nextPage}`);
    } catch (error) {
      // No more product pages
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProductDetails(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract JSON data
    const jsonText = $('script[type="application/ld+json"]')[0].children[0].data;
    const data = JSON.parse(jsonText);

    // Extract product details
    const productCode = data.mpn;
    const fullPrice = data.offers.price;
    const price = fullPrice;
    const productUrl = response.config.url;

    // Output data
    console.log({
      productCode: productCode,
      fullPrice: fullPrice,
      price: price,
      productUrl: productUrl
    });
  } catch (error) {
    console.error(error);
  }
}

scrape(startUrl);

This is great. The scraper is correct and works perfectly.


Final remarks

I have to admit I’m impressed by the results of these tests. While it’s true it didn’t save me much time to me when I wrote the Python scraper, it enabled me to write the Node.js one.


AI can lower the barriers to the adoption of a new language and help develop the scrapers while still studying it, helping also to debug the code written. It doesn’t substitute the good old hands-on practice but it could help learn faster.


In the end, AI at the moment is basically more than an aid than a threat that could replace humans in the near future.



Also published here.