paint-brush
How to Scrape Large Datasets at Scaleby@terieyenike
578 reads
578 reads

How to Scrape Large Datasets at Scale

by TeriJuly 5th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates. Bright Data is a proxy network that helps you turn websites into structured data.

People Mentioned

Mention Thumbnail
featured image - How to Scrape Large Datasets at Scale
Teri HackerNoon profile picture

Large organizations have put effort into building an application and frown at developers extracting data from their websites. That is why they put a gateway in the form of user agents to let you know what is permitted.


In most sites, you can find these details in the robots.txt file attached to a live URL just like using this link below:


https://www.amazon.com/robots.txt





In this article, you’ll learn about using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates.


Benefits of the Web Scraper IDE

These are some of the benefits of using Bright Data’s Web Scraper IDE:

  • The IDE is accessible from within the platform
  • As the leader in proxy infrastructure, it offers scalability and accuracy in web scraping
  • Its code templates help to speed up development
  • It incorporates the Web Unlocker capability through the IDE to avoid captchas and blocking.


What is Bright Data?

Bright Data is a proxy network that helps you turn websites into structured data. To get started with the platform, create an account.


Check out this resource to learn more about Bright Data.


Working with Web Scraper IDE

On your account dashboard, click the Datasets and Web Scraper IDE icon, and afterward, select the Get started button to open the template window.


web scraper ide


The new window pop-ups a dialog box where you can select pre-existing dataset options to work with or create one from scratch if you desire to do so.


ebay


Select the eBay discovery and PDP options, and the page should look something like this with the collector code.


collector code


Now scroll down the page, and under the input tab, pass in the name of a product you want to analyze and extract its data. Once done, click the Preview button to run the preview and start the extraction.


input option


PS: You must also note that you can enter your scripts within the Interaction codesection.

Looking at the output result tab after running the preview, it formatted the result from the eBay website based on the following data classification as product_url, title, image, price of the product, and so on.


output data


Saving the collector

To save the collector, click on the Finish editing button to open the configuration page as seen below:


saving the collector


Initiate the collector by API

Under the My Scrapers tab, let’s initiate this project and work with the scripts provided by clicking the Initiate by API button.


initiate by API


Creating authorization token

Authorization in programming grants access to users and identifies you as the account's rightful owner.


account settings


Click on the Account settings menu at the bottom left of the window to create an API token.


account settings for token


Upon adding the API token, you will receive a token for verification; enter the secret code.


add api token


Now that is done, copy your API token key, as you won’t be able to retrieve it unless you create a new one for use.


new api token


Return to the New collector page, and copy the scripts based on your operating system (OS) in your terminal. Make sure to replace API_TOKEN with the key you copied in the previous section after the word BEARER.


copy the scripts


In your command line interface or terminal, the result of the API code should look something like this:


curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -d '[{"keyword":"ralph lauren polo shirt","count":10,"location":"","condition":"New unused"}]' "https://api.brightdata.com/dca/trigger?collector=c_liopmjh61f3o3lz7dz&queue_next=1"


cli


The request command makes the code active in the Result API section of the New collector dashboard page. Once again, please copy and paste the code into the CLI tool.


result API


PS: Remember to put your API token key in place of the value API_TOKEN.


curl "https://api.brightdata.com/dca/dataset?id=j_liosdy1cdutdi7sod" -H "Authorization: Bearer API_TOKEN"


Run the script in the CLI, and the datasets in an object with status should read building and a message.


status and message report


If the response continues to show, retry sending the request. When successful, you should see this result object.


web scraper results


Using Postman

Like the displayed object above, let’s use Postman to get the response for the Result API.


If you do not have Postman, download it here. Postman is an API platform for building, publishing, monitoring, testing, and documenting APIs. Check this resource article to learn more about Postman and its use.


Open the Postman app and input these values:

  • In the request section in Postman, pass in the URL in the GET method
  • Click the Authorization tab, and select the Bearer Token from the Type dropdown, pass in your token value
  • Click on Send button to send the request
  • If the request is successful, you should see a status message of 200 in the response section and an array of objects for the queried scraped eBay data


postman


Creating a Node Server

Node is a JavaScript runtime environment that allows the execution of JavaScript code outside the web browser enabling developers to build server-side applications and command-line tools.


Let’s create a web server. One of the requirements for initializing your project in the terminal is using the package manager, npm, which is automatically present after installing Node.js on your local machine. Check it using this command:


node --version


It displays the current version of Node.

  1. Create a new directory. For this project, it is named datasets.
  2. Change its directory and initialize the project with the command:


cd datasets

npm init -y


The -y flag accepts the defaults that look like this:


package.json

{
  "name": "datasets",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  ...
}


  1. Install the following packages:


npm install -D nodemon


Nodemon automatically updates and monitors for any changes in your source code and automatically restarts the server.


npm install csv-parse


The csv-parse package is a parser for converting CSV text input into an array or objects.


Now, update the script section in the package.json file to this:


{
  "name": "datasets",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "start": "node index.mjs",
    "start:dev": "nodemon index.mjs"
  },
  ...
}


  1. Next, create a new file in the root directory with the command:


touch index.mjs


To test this file, write a basic JavaScript script and run the server with the following command:


npm run start:dev


Social Media Data from Bright Data

Scraping large datasets requires lots of effort and work using technologies like Node or Python. The way to get around this is to use a platform like Bright Data to obtain the information you need to get your results as soon as possible.


Let’s get this dataset from Bright Data which will be the social media platform Instagram, with these steps:


  1. Sign up for a Bright Data account.
  2. Go to https://brightdata.com/cp/datasets/ or select the Dataset Marketplace on the Datasets & Web Scraper IDE.


dataset marketplace


  1. Open the Dataset Marketplace, and under Categories, select Instagram.com from the Social media dropdown.


Instagram


  1. Click on View dataset and download the sample dataset in CSV format.


download CSV


Make sure to save the dataset in the root directory of the Node web server.


Your folder structure should look something like this:


.
└── datasets
    ├── node_modules
    ├── instagram.csv
    ├── package-lock.json
    ├── package.json
    └── index.mjs



Reading CSV datasets in Node.js

For this section, Node will read the comma-separated values (CSV), which dataset is from Bright Data.


Update the index.mjs file with the code:


import { parse } from "csv-parse";
import { createReadStream } from "node:fs";

const instagramAccount = [];

const isInstagramAccount = (info) => {
  return (
    info["posts_count"] > 300 &&
    info["followers"] > 6000 &&
    info["biography"] !== "" &&
    info["posts"] !== ""
  );
};

createReadStream("instagram.csv")
  .pipe(
    parse({
      columns: true,
    })
  )
  .on("data", (data) => {
    if (isInstagramAccount(data)) {
      instagramAccount.push(data);
    }
  })
  .on("error", (err) => {
    console.log("error", err);
  })
  .on("end", () => {
    console.log(`${instagramAccount.length} accounts are live`);
    console.log("done");
  });



The code above does the following:

  • Using the createReadStream() method to open up a file or stream and read the data in it
  • isInstagramAccount callback function used to filter the data from the actual CSV file
  • pipe: For connecting two streams, which means it connects to a readable stream source into the writeable destination, parse()
  • columns:true: Represents returning each row in our CSV file as a Javascript object with key-value pairs rather than just an array of values
  • .on : Event handlers chaining pushing the newly created data into the empty array, instagramAccount and displays an error, shows the number of Instagram accounts present, and finally indicates done when the script finishes


Running the scripts with the command npm run start:dev should display the result like this in the terminal:


643 accounts are live
done


Conclusion

Web scraping is an integral part of data extraction used in data science. The Web Scraper IDE by Bright Data does all the heavy lifting in the background, presenting only the relevant data for your use.


This article walked you through understanding how to use the Web Scraper IDE and how you can build a custom datasets script to query large datasets of companies without fear of getting blocked by the company’s bots designed to help protect the company’s data.


Resources