LinkedIn is full of useful data. From high-profile leads and skilled employee candidates to huge job listings and business opportunities. All this information can be accessed by hand as it’s made publicly available for all users and non-users. But what if we want to access this data on a larger scale? Today, we want to show you how you can harness the power of web scraping to pull data from LinkedIn job listings. Is It Legal to Scrape LinkedIn Data? Yes, scraping LinkedIn pages is legal, as shown in the 2019 LinkedIn vs. HiQ case. However, LinkedIn has appealed the decision to the US Supreme Court (SCOTUS) without getting any response back – as far as we know. Until we hear back from SCOTUS, “the decision by the 9th Circuit remains good law.” We’ll keep an eye on this case and update this article as soon as anything changes – and we recommend you do the same. Scraping LinkedIn Job Postings with JavaScript Although scraping LinkedIn is legal, we clearly understand LinkedIn itself doesn’t want to be scraped, so we want to be respectful when building our bot. One thing we’ll avoid on this project is using a headless browser to login into an account and access what would be considered private data. Instead, we’ll focus on scraping public LinkedIn data that doesn’t require us to trespass any login screen. We’ll go into LinkedIn public job listing page and use Axios and Cheerio to download and parse the HTML to extract the job title, company, location, and URL of the listing. 1. Install Node.js, Axios, and Cheerio If you haven’t already, you’ll need to . The latter will help us install the rest of our dependencies. Because we’re building our project on an M1 Mac, we picked the ARM64 version. download and install Node.js and NPM After installing those, let’s create a folder for our project called “Linkedin-scraper-project” and open it on VScode (or your editor of preference). Pull up the terminal and create a new project using . npm init -y Note: If you want to verify if the installation went well, you can use and . node -v npm -v We’re now ready to install our dependencies using the following commands: Axios: npm install axios Cheerio: npm install cheerio Or we could install both with one command: . npm install axios cheerio To get the ball rolling, inside our folder let’s create an index.js file and import our dependencies at the top: const axios = require('axios'); const cheerio = require('cheerio'); 2. Use Chrome DevTools to Understand LinkedIn’s Site Structure Before writing anything else, we need to build a plan on how we’ll access the data. To do so, we’ll open on our browser and see what we get. https://www.linkedin.com/ If you’re logged in automatically, first sign out from your account and then move to LikedIn’s homepage to follow along. Note: LinkedIn immediately provides us with a search form we can use to access the exact job we’re looking for and narrow our search to a location. Let’s look for email developer jobs as an example. At first glance, every job seems to be inside a self-contained card we should be able to target to extract all the information we need. After inspecting the page, we can confirm that every job is inside a element: <li> If we take a closer look at the element, we can find the job title, company name, location, and the URL of the listing. Everything is well organized and with a clear structure we can follow. <li> However, there’s a catch. After going down the page, it turns out LinkedIn is using infinite scrolling to populate the page with more jobs instead of the traditional numbered pagination. To deal with infinite pagination, we could use a headless browser like Puppeteer to scrape the first batch of jobs, scroll down the page, wait for more jobs to load, and scrape the new listings. But if you paid attention to this tutorial’s heading, we’re not going to use a headless browser. Instead, let’s try to be smarter than the page. 3. Use DevTool’s Network Tab One thing to have in mind when dealing with infinite scrolling paginations is that the new data needs to come from somewhere. So if we can access the source, we can access the data. With the DevTools open, go to the Network tab and reload the page: We’ll now see all the different requests the website is making. We’ll focus on the mainly because this is where we’ll see the new data pull. Fetch/XHR requests After scrolling down to the last job, the page sends a new request to get the data to the URL in the screenshot below. If we mimic this request, we’ll get access to the same data. Let’s test it on our browser by copying and pasting the URL. Awesome, this page is using the same structure, so there shouldn’t be any issues to scrape the data. But let’s take our experimentation a little further. This URL has the data for the second page but we want to access the first too. Let’s compare URL 1 against URL 2 to see how they change: URL 1: https://it.linkedin.com/jobs/search?keywords=email%20developer&location=United%20States&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0 URL 2: https://it.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%20developer&location=United%20States&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=25 There’s a lot going on in those URL. However, what we care about the most is the end bit of it, let’s try URL 2 but with the parameter on our browser to see if we can access the data. start=0 That did the trick! The first job on both pages is the same. Experimenting is crucial for web scraping, so here are a few more things we tried before settling for this solution: Changing the parameter doesn’t change anything on the page. pageNum The parameter increases by 25 for every new URL. We found this out by scrolling down the page and comparing the fetch requests sent by the site itself. start Changing the parameter by 1 (so , , and so on) will change the resulting page by hiding the previous job listings out of the page – which is not what we want. start start=2 start=3 The current last page is . It goes to a 404 page when hitting 1000. start=975 Having our initial URL, we can move to the next step. 4. Parse LinkedIn Using Axios and Cheerio We’ll change a little bit our initial URL to access the US version of the site. const axios = require('axios'); const cheerio = require('cheerio'); const { html } = require('cheerio/lib/static'); let url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=0' axios(url) .then (response => { const html = response.data; console.log(html) }) To check if our Axios request is working, we’ll our html variable. console.log() Awesome, we were able to download the raw HTML, and we can now pass it to Cheerio for parsing. Replace the method with the next snippet to create a Cheerio object we can then query to extract our target data. console.log() const $ = cheerio.load(html); 5. Pick Your Selectors Cheerio uses a JQuery implementation to select elements, so if you’re already familiar with JQuery’s syntax, you’ll find yourself right at home. To pick the right selectors, let’s go back to the opened URL and take note of the attributes we can use. We already know that all the data we need is inside of individual elements, so let’s target that first and store it into a . <li> constant const jobs = $('li'); With all the listings stored inside jobs, we can now go one by one and extract the specific bits of data we’re looking for. We can test our selectors right inside the DevTools to avoid sending unnecessary requests to the server. If we inspect the job title, we can see that it is inside of an element with the class base-search-card__title. <h3> Let’s go to the Console tab inside DevTools and use the method and pass it – where dot means class – as the argument, to select all the elements with that class. document.querySelectorAll() ‘.base-search-card__title’ It returns a NodeList of 25, which matches the number of jobs on the page. We can do the same thing for the rest of our targets. Here are our targets: Job title: ‘h3.base-search-card__title’ Company: ‘h4.base-search-card__subtitle’ Location: ‘span.job-search-card__location’ URL: ‘a.base-card__full-link’ 6. Iterate Through the Node List Using .each() With our selectors defined, we can now iterate through the list stored in the jobs variable to extract all job details using the method. .each() jobs.each((index, element) => {``       const jobTitle = $(element).find('h3.base-search-card__title').text()``       console.log(jobTitle)``   }) We bet you’ve realized how important it is for us to test everything, so before trying to grab every element, we’ll begin by extracting just the job title and logging it into our console. Type in the terminal to run it. node index.js Huh… it seems like there’s a lot of white space around the title and it’s also getting extracted by our method. Not to worry, we can use the method to clean it up. .text() .trim() Much better! Now we know it’s safe to add the rest of the selectors into our script. const company = $(element).find('h4.base-search-card__subtitle').text().trim() const location = $(element).find('span.job-search-card__location').text().trim() const link = $(element).find('a.base-card__full-link').attr('href') For the job’s link, we’re not actually interested in grabbing the text inside the element but the value of the attribute – in other words, the URL itself. To do so, we can just call the method and pass it the attribute we want the value from as the argument. href .attr() 7. Push Your Data Into an Array for Formatting Right now, our script will log into our console with a lot of unorganized strings that will make it really hard for us to use it. The good news is that we can use a simple method to push our data into an empty array and have everything neatly formatted automatically. First, we’ll create an empty array before Axios called LinkedIn jobs: linkedinJobs = []; Then, we’ll add the following code snippet right after const link: linkedinJobs.push({``           'Title': jobTitle,``           'Company': company,``           'Location': location,``           'Link': link,``       }) And log Linkedinjobs to the console. How does it look now? All our data is perfectly labeled and formatted, ready for us to export it into a JSON or CSV file. Still, this is not all the data we want. Our next step is to navigate to the rest of the pages and repeat the same process. 8. Scraping All Pages with a for loop There are many ways to navigate to the next page, but with our current knowledge of the website, the easiest way would be to increase the start parameter in the URL by 25 to display the next 25 jobs until there are no more results and a meets this functionality perfectly. for loop To refresh your memory, here’s the syntax: for loop’s for (statement 1; statement 2; statement 3) { //code block to be executed } Let’s build this part first by adding our custom statements: for (let pageNumber = 0; pageNumber < 1000; pageNumber += 25) { } Our starting point will be 0 as that’s the first value we want to pass to the parameter. start Because we know that when hitting 1000 there won’t be any more results, we want the code to run as long as is less than 1000. pageNumber Finally, after every iteration of the code, we want to increase by 25, effectively moving to the next page. pageNumber Before moving the rest of the code inside the loop, we’ll need to add our pageNumber variable into the URL: let url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=${pageNumber}` And now we add the rest of our code to the loop, now using the pageNumber variable as the value for the parameter. start 9. Write Your Data to a CSV File It has been a long process but we’re almost done with our scraper. Yet, having all that data log to our console isn’t exactly the best way to store it. To cut corners, we’ll be using the package. They have really detailed documentation if you’d like to go deeper into the package but, in simple terms, it will convert our array of JavaScript objects ( ) into a CSV format we can save to our machine. Objects-to-CSV linkedinJobs First, we’ll install the package using and add it to the top of our project. npm install objects-to-csv const ObjectsToCsv = require('objects-to-csv'); We can now add use the package right after closing our method: job.each() const csv = new ObjectsToCsv(linkedinJobs)``       csv.toDisk('./linkedInJobs.csv', { append: true }) “The keys in the first object of the array will be used as column names” so it’s important that we make them descriptive when using the method. .push() Also, because we want to loop through several pages, we don’t want our CSV to be overwritten every time but to add the new data below. To do so, all we need to do is set to . It will only add the headers once and keep updating the file with the new data. append true 10. Run Your Code [Full LinkedIn Scraper Code] If you’ve followed along, here’s what your finished code should look like: const axios = require('axios'); const cheerio = require('cheerio'); const ObjectsToCsv = require('objects-to-csv'); linkedinJobs = []; for (let pageNumber = 0; pageNumber < 1000; pageNumber += 25) { let url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=${pageNumber}`; axios(url) .then (response => { const html = response.data; const $ = cheerio.load(html); const jobs = $('li') jobs.each((index, element) => { const jobTitle = $(element).find('h3.base-search-card__title').text().trim() const company = $(element).find('h4.base-search-card__subtitle').text().trim() const location = $(element).find('span.job-search-card__location').text().trim() const link = $(element).find('a.base-card__full-link').attr('href') linkedinJobs.push({ 'Title': jobTitle, 'Company': company, 'Location': location, 'Link': link, }) }); const csv = new ObjectsToCsv(linkedinJobs) csv.toDisk('./linkedInJobs.csv', { append: true }) }) .catch(console.error); } To run it, go to your terminal and type node (or the name of your file): index.js Congratulations, over 15k job listings scraped in a few seconds. For next steps, export this data to excel and filter the column for only those containing the keyword “email developer” or by location. Title Wrapping Up: Here’s a Challenge for You We’ve covered a lot during this tutorial, but there are a few more things we challenge you to try: Make the script filtered jobs by title, so you only extract those cards containing the keyword “email developer” or “html email”. Right now the script is going through the pages too fast. Add a buffer between requests so it only sends a new request after 5 seconds, it will help you protect your IP. *Previously published * here.

Ball

Fetch

Target

A Step-by-Step Guide to Building a Football Data Scraper

How Do I Build a LinkedIn Scraper For Free?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

8 Most Important Metrics for SaaS Businesses

116 Stories To Learn About Web Scraping

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

53 Stories To Learn About Data Scraping

8 Most Important Metrics for SaaS Businesses

116 Stories To Learn About Web Scraping

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

53 Stories To Learn About Data Scraping

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps