paint-brush
Using Pandas For Dynamic Sitemap Generationby@manfye
711 reads
711 reads

Using Pandas For Dynamic Sitemap Generation

by Manfye GohMay 5th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

An XML sitemap is a file that acts as the roadmap of your website that leads Google to all your important pages. It is an important search engine optimization tool that allows you to list your website pages in the major search engines such as Google and Bing. The solution to the problem is — Dynamic Sitemap. In this article, I’ll demonstrate how to use pandas to generate a sitemaps from dummy data that resembles the queried item data. With the latest version (Pandas 1.4 and above), a sitmap can easily be generated from the data frame.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Using Pandas For Dynamic Sitemap Generation
Manfye Goh HackerNoon profile picture

Imagine you are in a building, inside each room there is a different kind of item to show to your visitor. How do you let the visitor know which room to visit? Yes, you need a map. This applies to your website too, in order to let search engines such as Google and Bing know what is the content in your website, you need a good XML sitemap.


An XML sitemap is a file that acts as the roadmap of your website that leads Google to all your important pages. [1] Hence, as a data-driven marketer, it is an important search engine optimization (SEO) tool that allows you to list your website pages in the major search engines.

The basic structure of the sitemap

According to Google [2], a very basic XML sitemap that includes the location of a single URL is as below:


<?xml version="1.0" encoding="UTF-8"?> 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/foo.html</loc>
<lastmod>2018-06-04</lastmod>
</url>
</urlset>

The Sitemap protocol format consists of XML tags. All data values in a Sitemap must be entity-escaped. The file itself must be UTF-8 encoded.

The Sitemap must:

  • Begin with an opening <urlset> tag and end with a closing </urlset> tag.
  • Specify the namespace (protocol standard) within the <urlset> tag.
  • Include a <url> entry for each URL, as a parent XML tag.
  • Include a <loc> child entry for each <url> parent tag.


All other tags are optional. Support for these optional tags may vary among search engines. Also, all URLs in a Sitemap must be from a single host, such as www.example.com or store.example.com. [3]


The Problem of Static Sitemap

Recently I had been working on one of my projects — AIPharm. It displays the daily updated pharmaceutical data. The data is continuously updated and new pages are added daily. Without a new sitemap, it will take months for Google to discover the new pages. The solution to the problem is — Dynamic sitemap.


Why Pandas?

The problem of static sitemap leads me to the road of searching the solution of dynamically generating sitemap and I found the easiest solution available: Pandas!


With the latest version (Pandas 1.4 and above), a sitemap can easily be generated from the data frame. The other advantages of using pandas included:

  • Easy manipulation of pages
  • Python friendly
  • Few lines of code required


Let's start our coding now 💪


In the section below I’ll demonstrate how to generate a sitemap from dummy data that resembles the queried item data. You can obtain the JSON here and the final code of this article here.


Package required:

By default, most python users already install pandas in their environment. It is important to check the version of pandas before starting to code.


pd.version 


Make sure your pandas' version is above 1.4.0


Upgrade your pandas if it is below 1.4.0:

pip install --upgrade pandas==1.4.1 


Import the necessary packages and start coding:

import pandas as pd 
import numpy as np 
import requests 
from datetime 
import datetime 
import urllib.parse 
import re 

now = datetime.now() 


Next, we need to get the full list of dynamic items and load it into the data frame, note that the code might be varied depending on your backend data:


r = requests.get("https://reqres.in/api/products") j = r.json()["data"] df = pd.DataFrame.from_dict(j) 


Generate Dynamic Item List

Here I use the URL https://example.xyz/ as my URL prefix. I would like to do a pattern of https://example.xyz/product/<id>/<product-name>. Since the product is updated monthly, I will put a priority of 0.6 and changefred of monthly

def returnURL(name,id,type):
    pattern = re.compile(r"[^\w\s]")
    url_name = pattern.sub("", name)
#     print(url_name)
    url_name = url_name.lower().replace(" ","-")
    url = "https://example.xyz/"+type+"/"+str(id)+"/"+urllib.parse.quote(url_name)
    return url

df["loc"] = df.apply(lambda x: returnURL(x["name"],x["id"],"products"),axis=1)
df["lastmod"] = now.strftime("%Y-%m-%d")
df["changefreq"] = "monthly"
df["priority"] = 0.6
df = df.reindex(columns=["loc","lastmod","changefreq","priority"])


After you created the table, remember to reindex the table to get only the 4 columns you need for the sitemap.


Generating Static Pages

Create a data frame with the 4 columns “loc”, ”lastmod”, ”changefreq”, ”priority”, then append all the data into the data frame

df_main = pd.DataFrame(columns=["loc","lastmod","changefreq","priority"], data=[])
df_main = df_main.append(pd.DataFrame(columns=["loc","lastmod","changefreq","priority"], data=[["https://example.xyz",now.strftime("%Y-%m-%d"),"daily",1.0]]))

array_list = ["page1","page2","page3"]
for i in array_list:
    df_main = df_main.append(pd.DataFrame(columns=["loc","lastmod","changefreq","priority"], data=[["https://example.xyz/"+i,now.strftime("%Y-%m-%d"),"daily",1.0]]))


Combine both lists

Once you have both lists, combine them to become the final table df_final. Remember to drop the index to fit the desired format.

df_final = df_main.append(df)
df_final = df_final.reset_index(drop=True)


Publishing 🎉

The essence of the whole article is here! simply use a dataframe.to_xml method to export the data frame as a sitemap. Use the setting below to comply with sitemap protocol:

df_final.to_xml("sitemap.xml" ,
                index=False,
                root_name='urlset',
                row_name='url',
                namespaces= {"": "http://www.sitemaps.org/schemas/sitemap/0.9"})


Note: Make sure you validate your sitemap upon generated for the first time by using the sitemap validator online


Bonus: Github Upload

To upload into GitHub, you can use pygithub package and simply use the code below:

from github import Github
# using an access token
g = Github("XXXXXXXX")
repo = g.get_repo("xxxx/medium_article")
with open('sitemap.xml', 'r') as file:
    content = file.read()
    
contents = repo.get_contents("public/sitemap.xml")
repo.update_file("public/sitemap.xml", "update sitemap", content, contents.sha, branch="main")


This code will upload the sitemap into your desired Github repo and the new sitemap will be available to google upon the website rebuild.


The code of this article is available here



Words from Author

As a react front-end developer and data lover, I found this way is simpler to use pandas to generate the sitemap. First, it lets me have a separate pipeline for generating the sitemap and I’m able to control when I need to generate the sitemap, I no need to purposely push something unnecessary and I’m able to set up cron tasks for sitemap generation.


Secondly, this method is easier to set up as compared to other methods available online such as the javascript method which is generated upon build, or other python methods that require multiple steps. It gives me total control of what to be generated and include in the sitemap and the capability to generate a more complicated sitemap in the future.


Lastly, I would like to thank you for reading my article



This article was first published here.

References:

[1] Yoast SEO

[2] https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap

[3] https://www.sitemaps.org/protocol.html