Large organizations have put effort into building an application and frown at developers extracting data from their websites. That is why they put a gateway in the form of user agents to let you know what is permitted. In most sites, you can find these details in the robots.txt file attached to a live URL just like using this link below: https://www.amazon.com/robots.txt In this article, you’ll learn about using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates. Benefits of the Web Scraper IDE These are some of the benefits of using Bright Data’s Web Scraper IDE: The IDE is accessible from within the platform As the leader in proxy infrastructure, it offers scalability and accuracy in web scraping Its code templates help to speed up development It incorporates the Web Unlocker capability through the IDE to avoid captchas and blocking. What is Bright Data? Bright Data is a proxy network that helps you turn websites into structured data. To get started with the platform, create an . account Check out this resource to learn more about . Bright Data https://www.youtube.com/watch?v=YzoLTalL6Uo&embedable=true Working with Web Scraper IDE On your account dashboard, click the icon, and afterward, select the button to open the template window. Datasets and Web Scraper IDE Get started The new window pop-ups a dialog box where you can select pre-existing dataset options to work with or create one from scratch if you desire to do so. Select the options, and the page should look something like this with the collector code. eBay discovery and PDP Now scroll down the page, and under the tab, pass in the name of a product you want to analyze and extract its data. Once done, click the button to run the preview and start the extraction. input Preview : You must also note that you can enter your scripts within the section. PS Interaction code Looking at the result tab after running the preview, it formatted the result from the eBay website based on the following data classification as , , , of the product, and so on. output product_url title image price Saving the collector To save the collector, click on the button to open the configuration page as seen below: Finish editing Initiate the collector by API Under the tab, let’s initiate this project and work with the scripts provided by clicking the button. My Scrapers Initiate by API Creating authorization token Authorization in programming grants access to users and identifies you as the account's rightful owner. Click on the menu at the bottom left of the window to create an API token. Account settings Upon adding the API token, you will receive a token for verification; enter the secret code. Now that is done, copy your API token key, as you won’t be able to retrieve it unless you create a new one for use. Return to the page, and copy the scripts based on your operating system (OS) in your terminal. Make sure to replace with the key you copied in the previous section after the word . New collector API_TOKEN BEARER In your command line interface or terminal, the result of the API code should look something like this: curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -d '[{"keyword":"ralph lauren polo shirt","count":10,"location":"","condition":"New unused"}]' "https://api.brightdata.com/dca/trigger?collector=c_liopmjh61f3o3lz7dz&queue_next=1" The request command makes the code active in the section of the dashboard page. Once again, please copy and paste the code into the CLI tool. Result API New collector : Remember to put your API token key in place of the value . PS API_TOKEN curl "https://api.brightdata.com/dca/dataset?id=j_liosdy1cdutdi7sod" -H "Authorization: Bearer API_TOKEN" Run the script in the CLI, and the datasets in an object with status should read and a . building message If the response continues to show, retry sending the request. When successful, you should see this result object. Using Postman Like the displayed object above, let’s use to get the response for the . Postman Result API If you do not have Postman, download it . Postman is an API platform for building, publishing, monitoring, testing, and documenting APIs. Check to learn more about Postman and its use. here this resource article Open the Postman app and input these values: In the section in Postman, pass in the URL in the GET method request Click the tab, and select the Bearer Token from the dropdown, pass in your token value Authorization Type Click on button to send the request Send If the is successful, you should see a status message of 200 in the section and an array of objects for the queried scraped data request response eBay Creating a Node Server is a JavaScript runtime environment that allows the execution of JavaScript code outside the web browser enabling developers to build server-side applications and command-line tools. Node Let’s create a web server. One of the requirements for initializing your project in the terminal is using the package manager, npm, which is automatically present after installing on your local machine. Check it using this command: Node.js node --version It displays the current version of Node. Create a new directory. For this project, it is named . datasets Change its directory and initialize the project with the command: cd datasets

npm init -y The flag accepts the defaults that look like this: -y package.json {
  "name": "datasets",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  ...
} Install the following packages: npm install -D nodemon Nodemon automatically updates and monitors for any changes in your source code and automatically restarts the server. npm install csv-parse The package is a parser for converting CSV text input into an array or objects. csv-parse Now, update the script section in the file to this: package.json {
  "name": "datasets",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "start": "node index.mjs",
    "start:dev": "nodemon index.mjs"
  },
  ...
} Next, create a new file in the root directory with the command: touch index.mjs To test this file, write a basic JavaScript script and run the server with the following command: npm run start:dev Social Media Data from Bright Data Scraping large datasets requires lots of effort and work using technologies like Node or Python. The way to get around this is to use a platform like Bright Data to obtain the information you need to get your results as soon as possible. Let’s get this dataset from Bright Data which will be the social media platform Instagram, with these steps: Sign up for a account. Bright Data Go to or select the on the . https://brightdata.com/cp/datasets/ Dataset Marketplace Datasets & Web Scraper IDE Open the , and under , select from the dropdown. Dataset Marketplace Categories Instagram.com Social media Click on and download the sample dataset in format. View dataset CSV Make sure to save the dataset in the root directory of the Node web server. Your folder structure should look something like this: .
└── datasets
    ├── node_modules
    ├── instagram.csv
    ├── package-lock.json
    ├── package.json
    └── index.mjs Reading CSV datasets in Node.js For this section, Node will read the comma-separated values (CSV), which dataset is from Bright Data. Update the file with the code: index.mjs import { parse } from "csv-parse";
import { createReadStream } from "node:fs";

const instagramAccount = [];

const isInstagramAccount = (info) => {
  return (
    info["posts_count"] > 300 &&
    info["followers"] > 6000 &&
    info["biography"] !== "" &&
    info["posts"] !== ""
  );
};

createReadStream("instagram.csv")
  .pipe(
    parse({
      columns: true,
    })
  )
  .on("data", (data) => {
    if (isInstagramAccount(data)) {
      instagramAccount.push(data);
    }
  })
  .on("error", (err) => {
    console.log("error", err);
  })
  .on("end", () => {
    console.log(`${instagramAccount.length} accounts are live`);
    console.log("done");
  }); The code above does the following: Using the method to open up a file or stream and read the data in it createReadStream() callback function used to filter the data from the actual CSV file isInstagramAccount : For connecting two streams, which means it connects to a readable stream source into the writeable destination, pipe parse() : Represents returning each row in our CSV file as a Javascript object with key-value pairs rather than just an array of values columns:true : Event handlers chaining pushing the newly created data into the empty array, and displays an error, shows the number of Instagram accounts present, and finally indicates when the script finishes .on instagramAccount done Running the scripts with the command should display the result like this in the terminal: npm run start:dev 643 accounts are live
done Conclusion Web scraping is an integral part of data extraction used in data science. The by Bright Data does all the heavy lifting in the background, presenting only the relevant data for your use. Web Scraper IDE This article walked you through understanding how to use the Web Scraper IDE and how you can build a custom datasets script to query large datasets of companies without fear of getting blocked by the company’s bots designed to help protect the company’s data. Resources CSV Parser for Node.js

Using Scraping Browser and GPT for Actionable Product Insights

Scraping the unscrapable in Python using Playwright

Portfolio

Nominated for 2022 - HackerNoon Contributor of the Year - Data Visualization

Nominated for 2022 - HackerNoon Contributor of the Year - Heroku

Nominated for 2022 - HackerNoon Contributor of the Year - Javascript

Nominated for 2022 - HackerNoon Contributor of the Year - Frontend

Nominated for 2022 - Remote Work Warrior

Nominated for 2022 - No No No Nodejs

Technical content creator

Too Long; Didn't Read

How to Scrape Large Datasets at Scale

How to Scrape Large Datasets at Scale

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

2021: Reviewing and Kaizen-ing My Programming and Writing Life

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

10 Threats to an Open API Ecosystem

10 Indications That You Should Invest in Automation Via APIs

10 Best Practices for Securing Your API

The Noonification: Getting Your API Into Production (10/28/2022)

2021: Reviewing and Kaizen-ing My Programming and Writing Life

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

10 Threats to an Open API Ecosystem

10 Indications That You Should Invest in Automation Via APIs

10 Best Practices for Securing Your API

The Noonification: Getting Your API Into Production (10/28/2022)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps