If you’re a or , you most probably need to do backlink audits regularly. There are situations when you are forced to find toxic backlinks and disavow them. However, it’s very hard to manually export and correlate all the backlinks data from . webmaster SEO specialist Google Search Console If the websites you’re working with are substantially large, there would be a lot of clicking and exporting involved to get this data out of GSC. It’s simply not doable. Google Search Console – Links Section Here is where , , and come in — they will allow you to scrape GSC and pull the data you need automatically. Python Beautiful Soup Pandas First things first: Install the following packages using pip: bs4, requests, re, pandas, csv 1. Emulate a user session In order to scrape for backlink information, we need to emulate a normal user. We do this by simply going into your browser of choice, opening the Links section in GSC and selecting the Top Linking Sites section. Once here, we need to inspect the source code of the page by right clicking and hitting . Google Search Console inspect In the development tools, we go to the network tab and select the first URL that appears and is a document type. It should be a request for a URL of the following type: https://search.google.com/search-console/links?resource_id=sc-domain%3A{YourDomainName} Click on the URL and look at the section for the section as per the image below: Headers Request Headers In order to emulate a normal session, we will need to add to our requests the request information that we see in the request header. Python A few notes on this process: You will notice that your request header will also contain cookie information. Python-wise, for the requests library, this information will be stored in a dictionary called cookies. The rest of the information will be stored in a dictionary named headers. In effect, what we are doing is taking the information from the header and transforming it into two dictionaries as per the code below. * replace [your-info] with your actual data from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import csv

headers = {
    "authority": "search.google.com",
    "method":"GET",
    "path":'"[your-info]"',
    "scheme":"https",  
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"en-GB,en-US;q=0.9,en;q=0.8,ro;q=0.7",
    "cache-control":"no-cache",
    "pragma":"no-cache",
    "sec-ch-ua":"navigate",
    "sec-fetch-site":"same-origin",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
    "x-client-data":"[your-info]",
    "sec-ch-ua":'" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    "sec-ch-ua-mobile": "?0",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate"
}
cookies = {
    "__Secure-1PAPISID":"[your-info]",
    "__Secure-1PSID":"[your-info]",
    "__Secure-3PAPISID":"[your-info]",
    "__Secure-3PSID":"[your-info]",
    "__Secure-3PSIDCC":"[your-info]",
    "1P_JAR":"[your-info]",
    "NID":"[your-info]",
    "APISID":"[your-info]",
    "CONSENT":"[your-info]",
    "HSID":"[your-info]",
    "SAPISID":"[your-info]",
    "SID":"[your-info]",
    "SIDCC":"[your-info]",
    "SSID":"[your-info]",
    "_ga":"[your-info]",
    "OTZ":"[your-info]",
    "OGPC":"[your-info]"
} The information displayed in your request header might be different in your case, don’t worry about the differences as long as you can create the two dictionaries. Once this is done, execute the cell with the header and cookies information as it’s time to start working on the first part of the actual script — collecting a list of referring domains that link back to your website. * replace [your-domain] with your actual domain genericURL = "https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target=&domain="
req = requests.get(genericURL, headers=headers, cookies=cookies)
soup = BeautifulSoup(req.content, 'html.parser') The above URL is in effect the URL in the section, so, please ensure that you update it accordingly. Top linking sites You can test that you are bypassing the login by running the following code: g_data = soup.find_all("div", {"class": "OOHai"})
for example in g_data:
    print(example)
    break The output of the above code should be a div with a class called “00Hai”. If you see anything of the sort, you can continue. 2. Create a List of Referring Domains The next step in this process will be to leverage Python and Pandas to return a list with all of the referring domains that point at your domain. g_data = soup.find_all("div", {"class": "OOHai"})

dfdomains = pd.DataFrame()
finalList = []
for externalDomain in g_data:
    myList = []
    out = re.search(r'<div class="OOHai">(.*?(?=<))', str(externalDomain))
    if out:
        myList.append(out.group(1))
    finalList.append(myList) 
dfdomains = dfdomains.append(pd.DataFrame(finalList, columns=["External Domains"]))

domainsList = dfdomains["External Domains"].to_list() The above code initialises an empty Pandas dataFrame, which will be populated with the external domains. The domains are identified by running through the entire HTML and identifying all of the divs that are in the “OOHai” class. If any such information is present, the dfdomains dataFrame will be populated with the name of the external domains. 3. Extract Backlink information for each Domain Next we will extract the backlink information for all domains, Top sites linking to this page and also Top linking pages (practically the 3rd level from GSC, only the first value). def extractBacklinks():
    for domain in domainsList[:]:
        url = f"https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target={domain}&domain="

        request = requests.get(url, headers=headers, cookies=cookies)
        soup = BeautifulSoup(request.content, 'html.parser')
        
        for row in soup.find_all("div", {"class": "OOHai"}):          
            output = row.text
            stripped_output = output.replace("", "")
        
            domain_stripped = str(domain.split('https://')[1].split('/')[0])
        
            print ("---------")
            print ("Domain: " + domain)
            print ("---------")
            
            url_secondary = f"https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target={domain}&domain={stripped_output}"
            request_secondary = requests.get(url_secondary, headers=headers, cookies=cookies)
            soup_secondary = BeautifulSoup(request_secondary.content, 'html.parser')
            for row in soup_secondary.find_all("div", {"class": "OOHai"}):
                output_last = row.text
                stripped_output_last = output_last.replace("", "")
                break

            with open(f"{domain_stripped}.csv", 'a') as file:
                writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                writer = writer.writerow([domain, stripped_output, stripped_output_last])
            file.close()

extractBacklinks() Because Beautiful Soup is returning some strange characters, we are stripping them using Python . method. replace All the URLs are added into a .csv file (located in the same directory where the script is present). Have fun! Also Published Here

Fetch

Google

Mozilla

Target

Web3's Hidden Problem: Data Accuracy

Artificial Intelligence And The Metaverse

Follow my Work!

Hire Me

Book a Call

Too Long; Didn't Read

Scraping Google Search Console Backlinks

Scraping Google Search Console Backlinks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

A Brief Intro to Play-to-Earn Games

Automation for Girl Scout Events

How to Master Web Scraping in Python: From Zero to Hero

How to Scrape Amazon Reviews with and without Code

Data Scraping Google Search Results Using Python and Scrapy

A Brief Intro to Play-to-Earn Games

Automation for Girl Scout Events

How to Master Web Scraping in Python: From Zero to Hero

How to Scrape Amazon Reviews with and without Code

Data Scraping Google Search Results Using Python and Scrapy

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps