Imagine you are in a building, inside each room there is a different kind of item to show to your visitor. How do you let the visitor know which room to visit? Yes, you need a map. This applies to your website too, in order to let search engines such as Google and Bing know what is the content in your website, you need a good XML sitemap.
An XML sitemap is a file that acts as the roadmap of your website that leads Google to all your important pages. [1] Hence, as a data-driven marketer, it is an important search engine optimization (SEO) tool that allows you to list your website pages in the major search engines.
According to Google [2], a very basic XML sitemap that includes the location of a single URL is as below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/foo.html</loc>
<lastmod>2018-06-04</lastmod>
</url>
</urlset>
The Sitemap protocol format consists of XML tags. All data values in a Sitemap must be entity-escaped. The file itself must be UTF-8 encoded.
The Sitemap must:
All other tags are optional. Support for these optional tags may vary among search engines. Also, all URLs in a Sitemap must be from a single host, such as www.example.com or store.example.com. [3]
Recently I had been working on one of my projects — AIPharm. It displays the daily updated pharmaceutical data. The data is continuously updated and new pages are added daily. Without a new sitemap, it will take months for Google to discover the new pages. The solution to the problem is — Dynamic sitemap.
The problem of static sitemap leads me to the road of searching the solution of dynamically generating sitemap and I found the easiest solution available: Pandas!
With the latest version (Pandas 1.4 and above), a sitemap can easily be generated from the data frame. The other advantages of using pandas included:
Let's start our coding now 💪
In the section below I’ll demonstrate how to generate a sitemap from dummy data that resembles the queried item data. You can obtain the JSON here and the final code of this article here.
Package required:
By default, most python users already install pandas in their environment. It is important to check the version of pandas before starting to code.
pd.version
Make sure your pandas' version is above 1.4.0
Upgrade your pandas if it is below 1.4.0:
pip install --upgrade pandas==1.4.1
Import the necessary packages and start coding:
import pandas as pd
import numpy as np
import requests
from datetime
import datetime
import urllib.parse
import re
now = datetime.now()
Next, we need to get the full list of dynamic items and load it into the data frame, note that the code might be varied depending on your backend data:
r = requests.get("https://reqres.in/api/products") j = r.json()["data"] df = pd.DataFrame.from_dict(j)
Generate Dynamic Item List
Here I use the URL https://example.xyz/ as my URL prefix. I would like to do a pattern of https://example.xyz/product/<id>/<product-name>. Since the product is updated monthly, I will put a priority of 0.6 and changefred of monthly
def returnURL(name,id,type):
pattern = re.compile(r"[^\w\s]")
url_name = pattern.sub("", name)
# print(url_name)
url_name = url_name.lower().replace(" ","-")
url = "https://example.xyz/"+type+"/"+str(id)+"/"+urllib.parse.quote(url_name)
return url
df["loc"] = df.apply(lambda x: returnURL(x["name"],x["id"],"products"),axis=1)
df["lastmod"] = now.strftime("%Y-%m-%d")
df["changefreq"] = "monthly"
df["priority"] = 0.6
df = df.reindex(columns=["loc","lastmod","changefreq","priority"])
After you created the table, remember to reindex the table to get only the 4 columns you need for the sitemap.
Generating Static Pages
Create a data frame with the 4 columns “loc”, ”lastmod”, ”changefreq”, ”priority”, then append all the data into the data frame
df_main = pd.DataFrame(columns=["loc","lastmod","changefreq","priority"], data=[])
df_main = df_main.append(pd.DataFrame(columns=["loc","lastmod","changefreq","priority"], data=[["https://example.xyz",now.strftime("%Y-%m-%d"),"daily",1.0]]))
array_list = ["page1","page2","page3"]
for i in array_list:
df_main = df_main.append(pd.DataFrame(columns=["loc","lastmod","changefreq","priority"], data=[["https://example.xyz/"+i,now.strftime("%Y-%m-%d"),"daily",1.0]]))
Combine both lists
Once you have both lists, combine them to become the final table df_final. Remember to drop the index to fit the desired format.
df_final = df_main.append(df)
df_final = df_final.reset_index(drop=True)
Publishing 🎉
The essence of the whole article is here! simply use a dataframe.to_xml method to export the data frame as a sitemap. Use the setting below to comply with sitemap protocol:
df_final.to_xml("sitemap.xml" ,
index=False,
root_name='urlset',
row_name='url',
namespaces= {"": "http://www.sitemaps.org/schemas/sitemap/0.9"})
Note: Make sure you validate your sitemap upon generated for the first time by using the sitemap validator online
Bonus: Github Upload
To upload into GitHub, you can use pygithub package and simply use the code below:
from github import Github
# using an access token
g = Github("XXXXXXXX")
repo = g.get_repo("xxxx/medium_article")
with open('sitemap.xml', 'r') as file:
content = file.read()
contents = repo.get_contents("public/sitemap.xml")
repo.update_file("public/sitemap.xml", "update sitemap", content, contents.sha, branch="main")
This code will upload the sitemap into your desired Github repo and the new sitemap will be available to google upon the website rebuild.
The code of this article is available here
Words from Author
As a react front-end developer and data lover, I found this way is simpler to use pandas to generate the sitemap. First, it lets me have a separate pipeline for generating the sitemap and I’m able to control when I need to generate the sitemap, I no need to purposely push something unnecessary and I’m able to set up cron tasks for sitemap generation.
Secondly, this method is easier to set up as compared to other methods available online such as the javascript method which is generated upon build, or other python methods that require multiple steps. It gives me total control of what to be generated and include in the sitemap and the capability to generate a more complicated sitemap in the future.
Lastly, I would like to thank you for reading my article
This article was first published here.
[1] Yoast SEO
[2]
[3]