paint-brush
Running a Python Script to Scrape LinkedIn Profiles From Googleby@tuangeek
9,903 reads
9,903 reads

Running a Python Script to Scrape LinkedIn Profiles From Google

by TuanJanuary 6th, 2022
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

LinkedIn is a great place to find leads and engage with prospects. In order to engage with potential leads, you’ll need a list of users to contact. I made a script to search Google for potential LinkedIn users and company profiles.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Running a Python Script to Scrape LinkedIn Profiles From Google
Tuan HackerNoon profile picture

LinkedIn is a great place to find leads and engage with prospects. In order to engage with potential leads, you’ll need a list of users to contact. However, getting that list might be difficult because LinkedIn has made it difficult for web scraping tools. That is why I made a script to search Google for potential LinkedIn users and company profiles.

Tools Required

You’ll need Python 2.7+ and some packages to get started. Once you have Python installed, you can run the following command to install the necessary packages.

pip install requests

LinkedIn Scraper Script

First, we need to import all the packages that we need. These packages are used for randomizing the user-agent and making the requests. Then regex is used to parse out the LinkedIn profiles and links.

import random
import argparse
import requests
import re

We create a LinkedinScraper class that tracks and hold the data for each of the requests. The class requires two parameters keyword and limit. The keyword parameter specifies the search term. The limit parameter specifies a max amount of links to search for.

class LinkedinScraper(object):
    def __init__(self, keyword, limit):
        """

        :param keyword: a str of keyword(s) to search for
        :param limit: number of profiles to scrape
        """
        self.keyword = keyword.replace(' ', '%20')
        self.all_htmls = ""
        self.server = 'www.google.com'
        self.quantity = '100'
        self.limit = int(limit)
        self.counter = 0

We create a LinkedinScraper class that tracks and hold the data for each of the requests. The class requires two parameters keyword and limit. The keyword parameter specifies the search term. The limit parameter specifies a max amount of links to search for.

The LinkedinScraper class has three main functions, searchparse_links, and parse_people.

The search function will perform the requests based on the keywords. It first generates a URL that is a Google-specific query based on the keyword and limit. Then it makes the requests and saves all the HTML into self.all_htmls

 def search(self):
        """
        perform the search
        :return: a list of htmls from Google Searches
        """

        # choose a random user agent
        user_agents = [
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1464.0 Safari/537.36',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) chromeframe/10.0.648.205',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/11.10 Chromium/18.0.1025.142 Chrome/18.0.1025.142 Safari/535.19',
            'Mozilla/5.0 (Windows NT 5.1; U; de; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00'
        ]
        while self.counter < self.limit:
            headers = {'User-Agent': random.choice(user_agents)}
            url = 'http://google.com/search?num=100&start=' + str(self.counter) + '&hl=en&meta=&q=site%3Alinkedin.com/in%20' + self.keyword
            resp = requests.get(url, headers=headers)
            if ("Our systems have detected unusual traffic from your computer network.") in resp.text:
                print("Running into captchas")
                return

            self.all_htmls += resp.text
            self.counter += 100

The parse_links function will search the HTML and perform regex parsing to extract out all the LinkedIn links.

def parse_links(self):
        reg_links = re.compile(r"url=https:\/\/www\.linkedin.com(.*?)&")
        self.temp = reg_links.findall(self.all_htmls)
        results = []
        for regex in self.temp:
            final_url = regex.replace("url=", "")
            results.append("https://www.linkedin.com" + final_url)
        return results

Similarly, parse_people function will search the HTML for their name and title.

def parse_people(self):
        """

        :param html: parse the html for Linkedin Profiles using regex
        :return: a list of
        """
        reg_people = re.compile(r'">[a-zA-Z0-9._ -]* -|\| LinkedIn')
        self.temp = reg_people.findall(self.all_htmls)
        print(self.temp)
        results = []
        for iteration in (self.temp):
            delete = iteration.replace(' | LinkedIn', '')
            delete = delete.replace(' - LinkedIn', '')
            delete = delete.replace(' profiles ', '')
            delete = delete.replace('LinkedIn', '')
            delete = delete.replace('"', '')
            delete = delete.replace('>', '')
            delete = delete.strip("-")
            if delete != " ":
                results.append(delete)
        return results

This is an example of using the class to search for 500 profiles for the Tesla company.

This is quite a simple script, but it should be a good starting point. It is missing some error and captcha handling when making too many requests to Google. I recommend using a Google Search API such as https://goog.io to perform unlimited searches. Or use RapidAPI Google Search API to perform the search with any language.

You can find the full code at https://github.com/googio/linkedin_scraper.git

This code is fast. Making too many requests to Google will result in getting your IP blocked. Please use proxies when running this script. Or check out goog.io API docs https://goog.io/docs on performing searches without worrying about getting blocked


Also published here