Large organizations have put effort into building an application and frown at developers extracting data from their websites. That is why they put a gateway in the form of user agents to let you know what is permitted.
In most sites, you can find these details in the robots.txt file attached to a live URL just like using this link below:
https://www.amazon.com/robots.txt
In this article, you’ll learn about using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates.
These are some of the benefits of using Bright Data’s Web Scraper IDE:
Bright Data is a proxy network that helps you turn websites into structured data. To get started with the platform, create an account.
Check out this resource to learn more about Bright Data.
On your account dashboard, click the Datasets and Web Scraper IDE icon, and afterward, select the Get started button to open the template window.
The new window pop-ups a dialog box where you can select pre-existing dataset options to work with or create one from scratch if you desire to do so.
Select the eBay discovery and PDP options, and the page should look something like this with the collector code.
Now scroll down the page, and under the input tab, pass in the name of a product you want to analyze and extract its data. Once done, click the Preview button to run the preview and start the extraction.
PS: You must also note that you can enter your scripts within the Interaction codesection.
Looking at the output result tab after running the preview, it formatted the result from the eBay website based on the following data classification as product_url, title, image, price of the product, and so on.
Saving the collector
To save the collector, click on the Finish editing button to open the configuration page as seen below:
Initiate the collector by API
Under the My Scrapers tab, let’s initiate this project and work with the scripts provided by clicking the Initiate by API button.
Creating authorization token
Authorization in programming grants access to users and identifies you as the account's rightful owner.
Click on the Account settings menu at the bottom left of the window to create an API token.
Upon adding the API token, you will receive a token for verification; enter the secret code.
Now that is done, copy your API token key, as you won’t be able to retrieve it unless you create a new one for use.
Return to the New collector page, and copy the scripts based on your operating system (OS) in your terminal. Make sure to replace API_TOKEN with the key you copied in the previous section after the word BEARER.
In your command line interface or terminal, the result of the API code should look something like this:
curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -d '[{"keyword":"ralph lauren polo shirt","count":10,"location":"","condition":"New unused"}]' "https://api.brightdata.com/dca/trigger?collector=c_liopmjh61f3o3lz7dz&queue_next=1"
The request command makes the code active in the Result API section of the New collector dashboard page. Once again, please copy and paste the code into the CLI tool.
PS: Remember to put your API token key in place of the value API_TOKEN.
curl "https://api.brightdata.com/dca/dataset?id=j_liosdy1cdutdi7sod" -H "Authorization: Bearer API_TOKEN"
Run the script in the CLI, and the datasets in an object with status should read building and a message.
If the response continues to show, retry sending the request. When successful, you should see this result object.
Like the displayed object above, let’s use Postman to get the response for the Result API.
If you do not have Postman, download it here. Postman is an API platform for building, publishing, monitoring, testing, and documenting APIs. Check this resource article to learn more about Postman and its use.
Open the Postman app and input these values:
Node is a JavaScript runtime environment that allows the execution of JavaScript code outside the web browser enabling developers to build server-side applications and command-line tools.
Let’s create a web server. One of the requirements for initializing your project in the terminal is using the package manager, npm, which is automatically present after installing Node.js on your local machine. Check it using this command:
node --version
It displays the current version of Node.
cd datasets
npm init -y
The -y
flag accepts the defaults that look like this:
package.json
{
"name": "datasets",
"version": "1.0.0",
"description": "",
"main": "index.mjs",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
...
}
npm install -D nodemon
Nodemon automatically updates and monitors for any changes in your source code and automatically restarts the server.
npm install csv-parse
The csv-parse
package is a parser for converting CSV text input into an array or objects.
Now, update the script section in the package.json
file to this:
{
"name": "datasets",
"version": "1.0.0",
"description": "",
"main": "index.mjs",
"scripts": {
"start": "node index.mjs",
"start:dev": "nodemon index.mjs"
},
...
}
touch index.mjs
To test this file, write a basic JavaScript script and run the server with the following command:
npm run start:dev
Scraping large datasets requires lots of effort and work using technologies like Node or Python. The way to get around this is to use a platform like Bright Data to obtain the information you need to get your results as soon as possible.
Let’s get this dataset from Bright Data which will be the social media platform Instagram, with these steps:
Open the Dataset Marketplace, and under Categories, select Instagram.com from the Social media dropdown.
Make sure to save the dataset in the root directory of the Node web server.
Your folder structure should look something like this:
.
└── datasets
├── node_modules
├── instagram.csv
├── package-lock.json
├── package.json
└── index.mjs
For this section, Node will read the comma-separated values (CSV), which dataset is from Bright Data.
Update the index.mjs
file with the code:
import { parse } from "csv-parse";
import { createReadStream } from "node:fs";
const instagramAccount = [];
const isInstagramAccount = (info) => {
return (
info["posts_count"] > 300 &&
info["followers"] > 6000 &&
info["biography"] !== "" &&
info["posts"] !== ""
);
};
createReadStream("instagram.csv")
.pipe(
parse({
columns: true,
})
)
.on("data", (data) => {
if (isInstagramAccount(data)) {
instagramAccount.push(data);
}
})
.on("error", (err) => {
console.log("error", err);
})
.on("end", () => {
console.log(`${instagramAccount.length} accounts are live`);
console.log("done");
});
The code above does the following:
createReadStream()
method to open up a file or stream and read the data in itisInstagramAccount
callback function used to filter the data from the actual CSV filepipe
: For connecting two streams, which means it connects to a readable stream source into the writeable destination, parse()
columns:true
: Represents returning each row in our CSV file as a Javascript object with key-value pairs rather than just an array of values.on
: Event handlers chaining pushing the newly created data into the empty array, instagramAccount
and displays an error, shows the number of Instagram accounts present, and finally indicates done when the script finishes
Running the scripts with the command npm run start:dev
should display the result like this in the terminal:
643 accounts are live
done
Web scraping is an integral part of data extraction used in data science. The Web Scraper IDE by Bright Data does all the heavy lifting in the background, presenting only the relevant data for your use.
This article walked you through understanding how to use the Web Scraper IDE and how you can build a custom datasets script to query large datasets of companies without fear of getting blocked by the company’s bots designed to help protect the company’s data.