Why Crawl Budget Matters for Larger Sites

Optimizing for Crawl Budget is an often-overlooked technical SEO hack that can have a big impact on larger websites.

Generally speaking, Google is really good at indexing pages. But if your site has thousands of pages (an e-commerce site, perhaps) or has recently added a whole new section, containing hundreds of new pieces of content; or even if they’re a mid-sized site but with lots of redirects, you’re going to want to think about crawl budget.

What exactly is a crawl budget?

Google divides its attention among millions of websites every day. Since it doesn’t have unlimited resources, it assigns a ‘budget’ to your site that’s generally determined by its size, ‘health’, and backlink portfolio.

It also takes into account the following two factors:

Crawl limit or host load - how much crawling can a website handle and what are its owner's preferences?
Crawl demand or crawl scheduling - which URLs are worth (re)crawling the most, based on their popularity and how often it's being updated.

How and when Google decides to crawl (or re-crawl) a page is not always predictable. It could be because new links have appeared, pointing at that content; perhaps someone shared it on social media, or maybe the URL was updated in the XML sitemap.

It’s a common misconception that Google crawls pages as soon as they’re published. In fact, sometimes it can take weeks, which might get in the way of your SEO efforts. This is especially true for larger sites that have 10,000 or more pages. These sites are at particular risk of having some of their most valuable content falling into the big, black Google vortex – and who wants that!

The truth is: If Google doesn’t index a page, it’s not going to rank for anything. And, if it doesn’t rank, it’s not going to be found.

So, if your number of pages exceeds your site’s crawl budget, there are going to be pages on your site that aren’t indexed and aren’t seen. Similarly, if you’re under your crawl budget but still have a medium to large site that doesn’t have a clear site structure, you can guarantee that it won’t be crawled as efficiently as you might like.

When crawl budget becomes a problem

Let’s say your site has 50,000 pages and Google is only allotted to crawl 2,000 each day. What will happen is that some pages (like the homepage) will get crawled more than others, which means it could take weeks before other pages (those that aren’t optimized as well) get noticed.

To determine whether your site has a crawl budget issue, you can follow the steps below:

Determine how many pages you have on your site. One way to do this would be to check the number of URLs in your XML sitemap.
Go into Google Search Console.
Go to “Settings” -> “Crawl stats” and take note of the average pages crawled per day.
Divide the number of pages by the “Average crawled per day” number.
If you end up with a number lower than 3, that’s ok. However, If you end up with a number higher than ~10 (so you have 10x more pages than the number Google crawls each day), you may have an issue.

The next step will be to find out which URLs Google is currently crawling on your site by looking at your site’s server logs.

Depending on your website host, you might not always be able to access your log files super easily. However, if your site is big, then I’d strongly recommend changing hosts to one that DOES let you see them. It matters.

You can’t fix a site by looking at it from the outside. You need to get in there and find the problem, otherwise, it’s all guesswork.

When checking how many times Google is crawling your website, it’s always safest to check multiple sources. Your site’s server logs and Google Console are the best places to start.

For Google Console, follow these steps:

Log in to Google Search Console and choose your website.
Go to Crawl > Crawl Stats. There you can see the number of pages that Google crawls per day.

While looking at your site’s server logs, you’ll also see which pages are classed as 404 – meaning ‘Page Not Found’.

Don’t ignore these. As a rule, every page should either be a 200 (an ‘OK’ page) or a 301 (meaning a redirect). Anything else is known as a ‘non-indexable’ page and should be fixed as soon as possible. Otherwise, you're keeping search engines busy sifting through irrelevant pages.

This is a waste of crawl budget, and the number one rule of optimizing your crawl budget is making sure it isn’t wasted! As obvious as it sounds, it’s much quicker to make fixes to existing crawl budget issues than to begin work on organically increasing them.

With that in mind, here are some other common reasons for wasted crawl budget:

1. Duplicate content

Pages that have exactly the same (or very similar) content, are known as duplicates. These might be copied pages, internal search results pages, or something else.

Limiting duplicate content is a good idea for lots of reasons. One is the crawl budget.

You don’t want search engines to waste their time, looking through multiple copies of the same page. So it's important to prevent, or at the very least minimize, the duplicate content on your site. You can do this by:

Setting up website redirects for all domain variants (HTTP, HTTPS, non-WWW, and WWW).
Make internal search result pages inaccessible to search engines using your robots.txt.
Disabling dedicated pages for images.
Be careful around your use of taxonomies such as categories and tags.

Duplicate content will sometimes be necessary (for example if you have print-friendly versions of certain web pages), but there are things you can do to signal to Google that you don’t need these pages to be indexed.

2. Pages with high load times

Site pages that take a long time to load have a negative impact on the crawl budget. This is because it’s a sign to search engines that your site can’t handle the request and Google will therefore de-prioritize that page. They may even adjust your crawl limit.

Another reason is that a faster-loading website means Google can crawl more URLs, more quickly. Although it’s a small fix, some sites have seen their crawl budget more than double when they upgraded it to be faster.

But putting crawl budget aside for a moment – slow load times (2 seconds or more) can significantly hurt your visitor’s user experience, resulting in a lower conversion rate. All in all, slow loads are a bad move for SEO.

3. Bad internal linking structure

Backlinks aside, how pages within your website link to one another play an important role in crawl budget optimization. If your internal link structure isn't set up correctly, search engines may not pay enough attention to some of your pages.

So make sure that your most important pages have plenty of internal links. Also avoid unbalanced linking – pages that just have one or two links, compared to others that have tens.

And don’t forget about external links, either. The dream would be to have a high-authority backlink pointing to every page on your site, but that’s not realistic. Build up your backlinks, but focus more on internal linking for immediate structural benefits.

4. Avoid ‘Orphan Pages’

Orphan pages are pages that have no internal or external links pointing to them.

Google has a really hard time finding orphan pages. So, if you want to get the most out of your crawl budget, make sure that there are at least two internal or external links pointing to every page on your site.

Once you’ve identified an orphan page, there are multiple different ways to go about fixing it. One option is to archive the page. This way, the page, and its information is still viewable, but it’s no longer part of your live site.

Another method would be to set up a redirect of the URL to a new location; ideally a relevant equivalent page or similar. Crawlers and site visitors will then be redirected to a page that you want them to see, and crawlers will index it accordingly.

5. Broken and redirecting links

Broken links and long chains of redirects are dead ends for search engines.

Google follows a maximum of five chained redirects in one crawl before they leave it – either entirely or to crawl later. This is why it’s strongly advised that you avoid chained redirects and keep the usage of redirects to a minimum.

By fixing broken links and redirecting links, you can quickly recover the wasted crawl budget. You’ll also be significantly improving your site visitor's user experience. Redirects (and chains of redirects, in particular) cause longer page load times and consequently hurt the overall user experience.

6. Avoid Low-quality content

Pages with little (or thin) content aren't interesting to search engines and do not tend to rank well.

Avoid them completely, if possible. Or, at the very least, tell Google that you don’t want them to be indexed using your robots.txt file.

However, you should ONLY do this if you’re sure what you’re doing. Otherwise, you could be blocking some of your most valuable content from being indexed.

In conclusion, when it comes to large sites, crawl budget is one of the most important factors from a technical SEO perspective. That’s because, if you do it right, you’ll automatically be optimizing for internal linking, page speed, URL errors, low-quality content, and more, too – all very important SEO success factors.

And, while smaller sites needn’t worry so much about crawl budget, the same principles of optimization will certainly help them to rank.

Also published here.