The Log File Analysis Shortcut

Log file analysis pipeline: Server access logs filtered for Googlebot, analyzed for crawl patterns, revealing crawl frequency and waste

Search Console tells you what Google thinks about your pages, what it believes it has indexed and how it interprets your content, but server logs tell you what Google actually does, where the crawler really goes and how often it returns, and the map is not the territory.

The difference matters enormously because logs show you crawl behavior that no paid tool can replicate, revealing which pages get crawled daily and which get ignored for months, where the crawler wastes budget on pages you don't care about and which important pages it somehow never visits.

Getting the Logs

Ask your hosting provider or dev team for access logs, and you want at least 30 days of data so you can see patterns rather than anomalies, where the format varies by server but you need at minimum the timestamp, URL requested, user agent, and status code for each request.

Filter to only Googlebot requests, which you can identify because the user agent contains "Googlebot" for web crawling, and ignore Googlebot-Image and Googlebot-Video unless image or video SEO specifically matters for your site.

The 4 Questions That Matter

1. What's getting crawled that shouldn't be?

Find URLs that Googlebot hits frequently but that have no business being in the index, and the common culprits are parameter URLs from sort orders, filters, and tracking; internal search results that create infinite URL combinations; paginated archives going on forever to page 47 and page 48 and beyond; and old URLs that should have been 301 redirected years ago but instead are still being crawled because nobody cleaned them up.

Every request to a junk URL is crawl budget not spent on the important pages you actually want Google to crawl and re-crawl frequently.

2. What's NOT getting crawled that should?

Compare your list of important URLs, the pages that actually matter to your business, against what Googlebot actually requests in the logs, and if any critical pages show zero crawls in 30 days, that's a problem that needs addressing, usually caused by poor internal linking, orphan pages with no links pointing to them, or pages buried too deep in your site architecture for the crawler to find efficiently.

3. How often do important pages get crawled?

Your homepage might get crawled 100 times per day while your most important product pages should get crawled at least weekly, so if key pages only get crawled monthly or less frequently, Google isn't treating them as important and you need to fix the internal linking, add them to your sitemap, and build more authority to those pages through internal and external links.

4. What status codes is Googlebot seeing?

Filter by response code and look for problems that waste crawl budget and confuse the crawler: 5xx errors that indicate server problems Google is hitting; 404s that indicate broken pages Googlebot keeps trying to find; 301 and 302 redirect chains where Google is crawling the redirect instead of the final destination; and soft 404s where pages return a 200 status code but have no actual content.

The Quick Analysis Method

You don't need fancy tools for this because command line works fine: filter to Googlebot requests, group by URL, count requests per URL, and sort by frequency to see what's getting the most attention.

The top 100 most-crawled URLs tell you exactly what Google thinks is important on your site, and if that list doesn't match what you think is important, if junk URLs are getting more attention than your money pages, you have work to do.

The bottom of the list, the URLs crawled once or never, shows what Google is ignoring, and if important pages are languishing there, you need to fix your architecture because Google isn't finding them or isn't considering them worth revisiting.

The Crawl Budget Reality

Small sites under 10,000 pages rarely have crawl budget problems because Google will get to everything eventually even if your internal linking is a mess, but large sites absolutely have crawl budget issues where, when you have 500,000 URLs and Googlebot only crawls 50,000 per month, you need to prioritize ruthlessly to ensure the right pages get crawled frequently.

Log analysis tells you exactly where to focus your efforts: block the junk that's wasting crawl budget, promote the good stuff that deserves more frequent crawling, and make every crawl count because you only have so many to work with.

The truth about crawl data

Log files show you what Google actually does, not what Google reports. Search Console data is sampled and delayed. Logs are complete and real-time. Trust the logs.

Most SEOs never look at server logs because it feels technical and intimidating and they'd rather stick with the comfortable dashboards of their favorite tools, but the ones who do look at logs find problems that nobody else can see, problems that no tool will surface because the tools don't have access to this data, and you should be the one who looks, the one who knows what's actually happening rather than what the reports claim is happening, and once you identify the crawl waste you should clean up your redirects to maximize crawl efficiency.

Want more tactical SEO?

Practical frameworks you can implement today.

Browse all notes

Part of Technical SEO: What Actually Matters

Related Takes - what I think

Technical SEO Is a Distraction SEO Is a Hardware Problem