Web scraping collects and extracts unstructured data from a website to a more readable structured format like JSON, CSV, and more. Organizations set guiding principleson scraped endpoints that are permitted.
When scraping a website for personal use, it can be stressful to manually change the code every time, as most big brand websites want people to refrain from scraping their public data. The following restrictions or problems might arise, such as CAPTCHAs, user agent (allowed and disallowed endpoints) blocking, IP blocking, and proxy network setup are set.
A practical use case of web scraping is notifying users of price changes for an item on sites like Amazon, eBay, etc.
In this article, you will learn how to use Bright Data’s Scraping Browser to unlock websites at scale without being blocked because of its built-in unlocking capabilities.
Test and run the complete code in this Codesandbox
It would help if you had the following to complete this tutorial:
Bright Data is a data collection or aggregation service with a massive network of internet protocols (IPs) and proxies to scrape information off a website, thereby having the resource to avoid detection by company bots that prevent data scraping.
In essence, Bright Data does the heavy lifting in the background because of its large datasets available on the platform, which removes the worry of being blocked or gaining access to website data.
A headless browser is a browser that operates without a graphical user interface (GUI). Modern web browsers like Google, Safari, Brave, Mozilla, and so on; all have a graphical interface for interactivity and displaying visual content. For headless browsers, it functions in the background with scripts or in the command line interface (CLI) written by developers.
Using a headless browser for web scraping is essential because it allows you to extract data from any public website by simulating user behavior.
Headless browsers are suitable for the following:
Puppeteer is an example of a headless browser. The following are some of the benefits of using Puppeteer in web scraping:
Create a new folder for this app, and run the command below to install a node server.
npm init -y
The command will initialize this project and create a package.json file containing all the dependencies and project information. The -y
flag accepts all the defaults upon initialization of the app.
With the initialization complete, let’s install the nodemon
dependency with this command:
npm install -D nodemon
Nodemon is a tool that will automatically restart the node application when the file changes.
In the package.json
, update the scripts object with this code:
package.json
{
...
"scripts": {
"start": "node index.js",
"start:dev": "nodemon index.js"
},
...
}
Next, create a file, index.js
, in the directory's root, which will be the entry point for writing the script.
The other package to install is the puppeteer-core
, the automation library without the browser used when connecting to a remote browser.
npm install puppeteer-core
Create an account on Bright Data to access all its services. But for this project, the focus would be on the Scraping Browser functionality.
On your admin dashboard, click on the Proxies and Scraping Infra.
Scroll to the bottom of the page and select the Scraping Browser. After that, click the Get started button from the proxy products listed.
On opening the tool, give the proxy a name and click the button, Add Proxy, and when prompted about creating a new zone, select Yes.
The next screen should be something like this, with the host, username, and password displayed.
Now, click on the button </> Check out code and integration examples and on the next screen, select Node.js as the language of choice for this app.
Environment variables are secret keys and credentials that should not be shared, hosted, or pushed to GitHub to prevent unauthorized access.
Before creating the .env
file in the root of the directory, let’s install this command:
npm install dotenv
Copy-paste this code to the .env
file, and replace the entire value in the quotation from your Access parameters tab:
.env
UNAME="<user-name>"
HOST="<host>"
Back to the entry point file, index.js, copy-paste this code:
index.js
const puppeteer = require("puppeteer-core");
require("dotenv").config();
const auth = process.env.UNAME;
const host = process.env.HOST;
async function run() {
let browser;
try {
browser = await puppeteer.connect({
browserWSEndpoint: `wss://${auth}@${host}`,
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(2 * 60 * 1000);
await page.goto("http://lumtest.com/myip.json");
const html = await page.content();
console.log(html);
} catch (e) {
console.error("run failed", e);
} finally {
await browser?.close();
}
}
if (require.main == module) run();
The code above does the following:
puppeteer-core
, and dotenv
host
and auth
variablesrun
functionpuppeteer
in the object using the key browserWSEndpoint
setDefaultNavigationTimeout
sets a navigation timeout for 2 minutesgoto
function, and afterward, get the URL's content with the page.content()
methodfinally
block
If you want to expand this project, you can take screenshots of the web pages in png or pdf format.
Check out the documentation to learn more.
Scraping the web with Bright Data infrastructure makes the process quicker for your use case without writing your scripts from scratch, as it is already taken care of for you.
Try it today to explore the benefits of Bright Data over traditional web scraping tools, restricted by proxy networks and make it challenging to work with large datasets.
Scraping Browser documentation
Scrape at scale with Bright Data Scraping Browser