Optimizing for Crawl Budget is an often-overlooked technical SEO hack that can have a big impact on larger websites.
Generally speaking, Google is really good at indexing pages. But if your site has thousands of pages (an e-commerce site, perhaps) or has recently added a whole new section, containing hundreds of new pieces of content; or even if theyâre a mid-sized site but with lots of redirects, youâre going to want to think about crawl budget.
What exactly is a crawl budget?
Google divides its attention among millions of websites every day. Since it doesnât have unlimited resources, it assigns a âbudgetâ to your site thatâs generally determined by its size, âhealthâ, and backlink portfolio.
It also takes into account the following two factors:
-
Crawl limit or host load - how much crawling can a website handle and what are its owner's preferences?
-
Crawl demand or crawl scheduling - which URLs are worth (re)crawling the most, based on their popularity and how often it's being updated.
How and when Google decides to crawl (or re-crawl) a page is not always predictable. It could be because new links have appeared, pointing at that content; perhaps someone shared it on social media, or maybe the URL was updated in the XML sitemap.
Itâs a common misconception that Google crawls pages as soon as theyâre published. In fact, sometimes it can take weeks, which might get in the way of your SEO efforts. This is especially true for larger sites that have 10,000 or more pages. These sites are at particular risk of having some of their most valuable content falling into the big, black Google vortex â and who wants that!
The truth is: If Google doesnât index a page, itâs not going to rank for anything. And, if it doesnât rank, itâs not going to be found.
So, if your number of pages exceeds your siteâs crawl budget, there are going to be pages on your site that arenât indexed and arenât seen. Similarly, if youâre under your crawl budget but still have a medium to large site that doesnât have a clear site structure, you can guarantee that it wonât be crawled as efficiently as you might like.
When crawl budget becomes a problem
Letâs say your site has 50,000 pages and Google is only allotted to crawl 2,000 each day. What will happen is that some pages (like the homepage) will get crawled more than others, which means it could take weeks before other pages (those that arenât optimized as well) get noticed.
To determine whether your site has a crawl budget issue, you can follow the steps below:
-
Determine how many pages you have on your site. One way to do this would be to check the number of URLs in your XML sitemap.
-
Go into Google Search Console.
-
Go to âSettingsâ -> âCrawl statsâ and take note of the average pages crawled per day.
-
Divide the number of pages by the âAverage crawled per dayâ number.
-
If you end up with a number lower than 3, thatâs ok. However, If you end up with a number higher than ~10 (so you have 10x more pages than the number Google crawls each day), you may have an issue.
The next step will be to find out which URLs Google is currently crawling on your site by looking at your siteâs server logs.
Depending on your website host, you might not always be able to access your log files super easily. However, if your site is big, then Iâd strongly recommend changing hosts to one that DOES let you see them. It matters.
You canât fix a site by looking at it from the outside. You need to get in there and find the problem, otherwise, itâs all guesswork.
When checking how many times Google is crawling your website, itâs always safest to check multiple sources. Your siteâs server logs and Google Console are the best places to start.
For Google Console, follow these steps:
- Log in to Google Search Console and choose your website.
- Go to Crawl > Crawl Stats. There you can see the number of pages that Google crawls per day.
While looking at your siteâs server logs, youâll also see which pages are classed as 404 â meaning âPage Not Foundâ.
Donât ignore these. As a rule, every page should either be a 200 (an âOKâ page) or a 301 (meaning a redirect). Anything else is known as a ânon-indexableâ page and should be fixed as soon as possible. Otherwise, you're keeping search engines busy sifting through irrelevant pages.
This is a waste of crawl budget, and the number one rule of optimizing your crawl budget is making sure it isnât wasted! As obvious as it sounds, itâs much quicker to make fixes to existing crawl budget issues than to begin work on organically increasing them.
With that in mind, here are some other common reasons for wasted crawl budget:
1. Duplicate content
Pages that have exactly the same (or very similar) content, are known as duplicates. These might be copied pages, internal search results pages, or something else.
Limiting duplicate content is a good idea for lots of reasons. One is the crawl budget.
You donât want search engines to waste their time, looking through multiple copies of the same page. So it's important to prevent, or at the very least minimize, the duplicate content on your site. You can do this by:
- Setting up website redirects for all domain variants (HTTP, HTTPS, non-WWW, and WWW).
- Make internal search result pages inaccessible to search engines using your robots.txt.
- Disabling dedicated pages for images.
- Be careful around your use of taxonomies such as categories and tags.
Duplicate content will sometimes be necessary (for example if you have print-friendly versions of certain web pages), but there are things you can do to signal to Google that you donât need these pages to be indexed.
2. Pages with high load times
Site pages that take a long time to load have a negative impact on the crawl budget. This is because itâs a sign to search engines that your site canât handle the request and Google will therefore de-prioritize that page. They may even adjust your crawl limit.
Another reason is that a faster-loading website means Google can crawl more URLs, more quickly. Although itâs a small fix, some sites have seen their crawl budget more than double when they upgraded it to be faster.
But putting crawl budget aside for a moment â slow load times (2 seconds or more) can significantly hurt your visitorâs user experience, resulting in a lower conversion rate. All in all, slow loads are a bad move for SEO.
3. Bad internal linking structure
Backlinks aside, how pages within your website link to one another play an important role in crawl budget optimization. If your internal link structure isn't set up correctly, search engines may not pay enough attention to some of your pages.
So make sure that your most important pages have plenty of internal links. Also avoid unbalanced linking â pages that just have one or two links, compared to others that have tens.
And donât forget about external links, either. The dream would be to have a high-authority backlink pointing to every page on your site, but thatâs not realistic. Build up your backlinks, but focus more on internal linking for immediate structural benefits.
4. Avoid âOrphan Pagesâ
Orphan pages are pages that have no internal or external links pointing to them.
Google has a really hard time finding orphan pages. So, if you want to get the most out of your crawl budget, make sure that there are at least two internal or external links pointing to every page on your site.
Once youâve identified an orphan page, there are multiple different ways to go about fixing it. One option is to archive the page. This way, the page, and its information is still viewable, but itâs no longer part of your live site.
Another method would be to set up a redirect of the URL to a new location; ideally a relevant equivalent page or similar. Crawlers and site visitors will then be redirected to a page that you want them to see, and crawlers will index it accordingly.
5. Broken and redirecting links
Broken links and long chains of redirects are dead ends for search engines.
Google follows a maximum of five chained redirects in one crawl before they leave it â either entirely or to crawl later. This is why itâs strongly advised that you avoid chained redirects and keep the usage of redirects to a minimum.
By fixing broken links and redirecting links, you can quickly recover the wasted crawl budget. Youâll also be significantly improving your site visitor's user experience. Redirects (and chains of redirects, in particular) cause longer page load times and consequently hurt the overall user experience.
6. Avoid Low-quality content
Pages with little (or thin) content aren't interesting to search engines and do not tend to rank well.
Avoid them completely, if possible. Or, at the very least, tell Google that you donât want them to be indexed using your robots.txt file.
However, you should ONLY do this if youâre sure what youâre doing. Otherwise, you could be blocking some of your most valuable content from being indexed.
In conclusion, when it comes to large sites, crawl budget is one of the most important factors from a technical SEO perspective. Thatâs because, if you do it right, youâll automatically be optimizing for internal linking, page speed, URL errors, low-quality content, and more, too â all very important SEO success factors.
And, while smaller sites neednât worry so much about crawl budget, the same principles of optimization will certainly help them to rank.
Also published here.