Аналіз логів сервера для SEO: як знайти проблеми Googlebot та покращити Crawl Budget

Q: Where are server log files located?

On Apache, log files are typically at /var/log/apache2/access.log and /var/log/apache2/error.log. On Nginx — at /var/log/nginx/access.log and /var/log/nginx/error.log. In cPanel, access them via Logs → Raw Access or File Manager in the /logs/ directory.

Server log files record every Googlebot request to your site — URL, status code, timestamp, frequency. Analyzing 30 days of logs reveals where crawl budget leaks, which pages the bot ignores, and where technical errors hide before they damage rankings.

Contents

What are access.log and error.log
How to read a log line
What logs reveal about Googlebot
Log analysis tools
Case study: 40% budget on technical URLs
What to look for in logs
Crawl budget through logs
Python script for log analysis
FAQ

What are access.log and error.log

Every web server maintains two core journals. access.log records every HTTP request the server receives — from real visitors, bots, monitoring systems. error.log captures failures: PHP errors, missing files, configuration issues.

For SEO, access.log is the primary source. It shows exactly when and how often Googlebot visited, which URLs it requested, and what status codes it received. Unlike Google Search Console — which is sampled, delayed, and filtered — logs are the raw, unedited record of every exchange between your server and the search robot.

Where to find log files

Server	access.log location	error.log location
Apache (Linux)	/var/log/apache2/access.log	/var/log/apache2/error.log
Nginx (Linux)	/var/log/nginx/access.log	/var/log/nginx/error.log
cPanel	cPanel → Logs → Raw Access	cPanel → Logs → Error Log
Windows IIS	C:\inetpub\logs\LogFiles\	Same folder as access logs
Plesk	Logs → GUI access	Logs → Error Log

Practical tip: In cPanel, download the compressed monthly log archive via Raw Access — it contains separate files per domain. For an average site, expect 200–500 MB per month. Always work with a local copy rather than analyzing directly on the server.

Diagram: from access.log you extract three types of SEO insight — Googlebot crawl activity, status code distribution, and crawl budget allocation

How to read a log line

A typical Combined Log Format entry (the default for both Apache and Nginx) looks like this:

66.249.66.1 - - [21/May/2026:14:23:11 +0200] "GET /blog/seo-audit/ HTTP/1.1" 200 48234 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Combined Log Format anatomy: each field carries distinct SEO information — from the crawling bot's identity to the exact response your server served

From this single line we learn: IP 66.249.66.1 (confirmed Googlebot via reverse DNS lookup), on May 21, 2026 at 14:23, made a GET request to /blog/seo-audit/, the server returned status 200 (success), response size 48 KB, and the User-Agent confirms Googlebot 2.1.

What logs reveal about Googlebot

Filtering access.log by Googlebot's User-Agent string surfaces four key SEO data groups:

Crawl frequency: how many times Googlebot visited per day, week, or month. Based on our log analysis across 50+ client sites, high-quality new articles on established domains get crawled within 2–6 hours of publication.
URL crawl distribution: which pages the bot visits most frequently and which it barely touches. This is a direct proxy for Google's own assessment of your site's priority pages.
Status codes for the bot: how many requests returned 200, 301, 404, or 500. GSC shows only a subset of this data, with delays. Logs show everything, real-time.
Crawl timing: which hours Googlebot is most active on your site. Useful for scheduling deployments, database maintenance, and server-heavy operations.

The most common surprise we find in client logs: the site owner assumes Googlebot is crawling product pages and articles, but 60–70% of bot requests are actually hitting static assets (CSS, JS, fonts) and URL parameter variations. The bot sees far fewer content pages than anyone realised.

Googlebot User-Agent strings to filter by

Bot	User-Agent string	What it crawls
Googlebot	Mozilla/5.0 (compatible; Googlebot/2.1; ...)	HTML pages — the main crawler
Googlebot-Image	Googlebot-Image/1.0	Images for Google Images
Googlebot-Mobile	...Mobile; Googlebot...	Mobile version of your site
AdsBot-Google	AdsBot-Google (+http://...)	Landing pages for ad quality checks
Google-InspectionTool	Google-InspectionTool/1.0	URL Inspection in GSC

Log analysis tools

Tool choice depends on log file size and the depth of analysis you need:

Tool	Type	Log size	Price	Best for
GoAccess	CLI / web UI	up to 1 GB	Free	Fast overview, real-time mode
Screaming Frog Log Analyser	Desktop app	up to 1 GB	Free / $259/yr	SEO-focused reports, bot filters
Python + pandas	Code	Unlimited	Free	Large logs, automation, custom metrics
ELK Stack	Server-side	Terabytes	Free / Enterprise	Enterprise sites, dashboards
Splunk	Enterprise	Unlimited	$150+/mo	ML anomaly detection, alerting

GoAccess: quick start

GoAccess is the fastest way to see a general crawl picture without writing code. Run it directly on your server:

# Install on Ubuntu/Debian
sudo apt-get install goaccess

# Analyze and generate an HTML report
zcat /var/log/nginx/access.log.gz | \
  grep -i "googlebot" | \
  goaccess - -o /tmp/googlebot-report.html \
  --log-format=COMBINED --no-global-config

Screaming Frog Log File Analyser

For teams without Python experience, this is the most accessible option. Load the log, set a User-Agent filter for "Googlebot," and get prebuilt reports: top URLs by crawl count, status code breakdown, activity charts by date. The free version handles up to 1,000 log lines — enough to verify the workflow before committing to a full analysis.

Tip from our workflow: Before loading a large log into Screaming Frog, pre-filter out static asset requests. Exclude lines ending in .css, .js, .png, .jpg, .woff2 — otherwise static files dominate the URL list and the real crawl distribution across content pages becomes unreadable.

Case study: 40% of crawl budget on technical URLs

One of our clients ran an e-commerce store with approximately 8,000 SKUs in the home goods category. Organic traffic had dropped 18% over a quarter despite consistent content updates and weekly product page publications. GSC showed no critical indexation errors.

We pulled access.log for 30 days (compressed archive: ~380 MB), isolated Googlebot requests, and classified URLs by type. The breakdown:

40% of Googlebot requests — filter parameter URLs: ?color=red&size=L&sort=price
15% of requests — session parameter URLs: ?PHPSESSID=abc123...
12% of requests — pagination duplicates: /catalog/page-1/ (even though canonical already pointed to /catalog/)
33% of requests — actual product and category pages that should have been indexed

Only a third of the crawl budget reached valuable URLs. New product pages published weekly were queuing up 2–3 weeks before Googlebot saw them, instead of the 2–3 days expected.

What we did

Added robots.txt Disallow rules for all filter and session parameters: Disallow: /*?*color=, Disallow: /*?*PHPSESSID=.
Set a proper canonical on all pagination pages pointing to the first page of each category (Google dropped support for rel=next/prev years ago).
Audited sitemap.xml — removed technical URLs that had snuck in via a bug in the sitemap generator.

Six weeks later, the useful crawl share rose from 33% to 74%. New product pages started appearing in the index within 2–5 days. Organic traffic recovered and added an additional +11% above the previous baseline.

E-commerce case study: useful crawl share rose from 33% to 74% after blocking technical URL parameters

What to look for in logs: a checklist

1. Googlebot 404 errors

Every 404 response to Googlebot is a crawl budget waste on a page that no longer exists. Common causes: deleted pages without redirects, stale URLs in sitemap.xml, internal links pointing to removed content.

# Find all Googlebot 404s, sorted by frequency
grep -i "googlebot" access.log | grep '" 404 ' | \
  awk '{print $7}' | sort | uniq -c | sort -rn | head -50

2. 5xx server errors

A 500, 502, or 503 response to Googlebot signals the server failed to process the request. When the bot systematically receives 5xx responses, it marks those pages as unreliable and reduces crawl frequency over time. We've seen even 2–3 days of server instability during peak load result in weeks of suppressed crawl activity.

3. Redirect chains

Each redirect is an extra request and a crawl time cost. If Googlebot routinely follows chains (A→B→C), update the source links to point directly to the final destination URL.

4. Priority pages with rare crawling

Cross-reference your sitemap.xml priority URLs against their log frequency over 30 days. A strategic page — category page, landing page, cornerstone article — that hasn't appeared in Googlebot logs for 14+ days needs a health check: confirm it's accessible via internal links, has no noindex tag, and is included in your sitemap.

Crawl budget through logs: measure and optimize

Crawl budget is the number of pages Googlebot will crawl on your site in a given session. It's not static — it adjusts based on domain authority, server response time, and how "rewarding" the bot finds your content.

From logs you can measure both components of crawl budget described in Google's official crawl budget documentation:

A detailed overview of all GSC tools and reports for SEO analysis — in our complete Google Search Console guide.

Crawl rate limit: the maximum speed at which the bot can crawl without overloading your server. Logs reveal how many simultaneous requests Googlebot makes and how quickly your server responds.
Crawl demand: how "interesting" Google considers your site. If crawl frequency increases after a content quality improvement, that's a measurable positive signal visible in the logs.

Crawl efficiency formula

Crawl efficiency = (Googlebot requests to content URLs) / (Total Googlebot requests) × 100%

Target: above 60%. Below this threshold, budget is leaking to junk URLs.

What to block from crawling

Filter and sort parameters: ?sort=price&order=asc
Session identifiers: ?PHPSESSID=, ?sessionid=
UTM parameters: ?utm_source=, ?utm_medium=
www / non-www duplicates: if no hard 301 redirect is in place
Internal site search pages: /search?q=
Login and account pages: /login, /account/
Staging and test URLs: if accidentally publicly accessible

Googlebot activity timeline: Wednesday crawl peak following mass article publication — the bot reacts to new sitemap entries within hours

Python script for automated log analysis

For regular monitoring — weekly or monthly — automating the analysis with a script pays off quickly. Here's a baseline Python script using pandas that we run as the first pass before any deeper investigation:

import pandas as pd
import re

LOG_FILE = '/var/log/nginx/access.log'
LOG_PATTERN = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) \S+" (\d+) \d+ "([^"]*)" "([^"]*)"'

rows = []
with open(LOG_FILE, 'r', encoding='utf-8', errors='ignore') as f:
    for line in f:
        m = re.match(LOG_PATTERN, line)
        if m:
            ip, dt, method, url, status, referer, ua = m.groups()
            rows.append({
                'ip': ip, 'datetime': dt, 'method': method,
                'url': url, 'status': int(status), 'ua': ua
            })

df = pd.DataFrame(rows)

# Filter for Googlebot only
bot_df = df[df['ua'].str.contains('Googlebot', case=False, na=False)].copy()

print(f"Total Googlebot requests: {len(bot_df)}")
print(f"\n=== Status code distribution ===")
print(bot_df['status'].value_counts())

print(f"\n=== Top 20 URLs by crawl frequency ===")
print(bot_df['url'].value_counts().head(20))

# Classify URLs
def classify_url(url):
    if any(p in url for p in ['?', 'PHPSESSID', 'utm_', 'sort=', 'filter']):
        return 'technical'
    elif any(ext in url for ext in ['.css', '.js', '.png', '.jpg', '.svg', '.woff']):
        return 'static'
    else:
        return 'content'

bot_df['url_type'] = bot_df['url'].apply(classify_url)
print(f"\n=== URL type distribution (%) ===")
print(bot_df['url_type'].value_counts(normalize=True).mul(100).round(1))

# 404s for Googlebot
errors_404 = bot_df[bot_df['status'] == 404]['url'].value_counts().head(30)
print(f"\n=== Top 30 URLs returning 404 to Googlebot ===")
print(errors_404)

The script outputs: total Googlebot request count, status code breakdown, top 20 URLs by crawl frequency, URL type distribution (technical / static / content), and the top 404 URLs. This is the starting point for every log audit we run for clients.

Automate it: Add the script to cron for weekly execution and save output to timestamped CSV files. Comparing week-over-week CSVs shows whether your robots.txt changes are improving crawl efficiency or introducing new problems.

Log file analysis is a core part of a thorough technical SEO audit. When working on website promotion, logs give you data no other tool can: an unfiltered record of exactly what Googlebot sees on your server, when, and how often.

In Practice

An analytics SaaS platform with public client dashboards — approximately 4,200 product pages plus an open-ended volume of user-generated report URLs accessible without authentication. Organic traffic to priority pages (pricing, features, documentation) was growing far slower than expected despite consistent content output.

GSC showed almost no indexation activity for core product pages. The client assumed a content quality issue. We pulled 45 days of access.log, a 420 MB archive, and loaded it into Screaming Frog Log File Analyser alongside a Semrush crawl of the indexed pages.

The data told a different story: 71% of all Googlebot requests were hitting URLs in the pattern /dashboard/client-id/report/ — public client dashboards, thousands of unique addresses carrying zero SEO value. Googlebot was methodically cycling through them while barely touching feature pages and landing pages.

After adding Disallow: /dashboard/ to robots.txt and verifying the shift via GSC Crawl Stats, indexation of core product pages grew 4x over 8 weeks. Semrush tracked a 38% visibility increase across target keywords.

Public user-generated URLs are a crawl budget trap unique to SaaS: they look like content to Googlebot but deliver no ranking value. If your platform generates publicly accessible per-user pages, checking the logs for this pattern is the first thing to do — before touching a single piece of content.

Frequently asked questions

Where are server log files located?

On Apache: /var/log/apache2/access.log and error.log. On Nginx: /var/log/nginx/access.log. In cPanel: Logs → Raw Access or File Manager → /logs/ directory. Windows IIS stores logs under C:\inetpub\logs\LogFiles\.

How often does Googlebot crawl a site?

Crawl frequency depends on domain authority and content update rate. High-authority sites are crawled multiple times daily; smaller sites may see Googlebot every few days. Exact numbers come only from log files: filter for Googlebot and count entries by date.

What is crawl budget and how do logs reveal it?

Crawl budget is the number of pages Googlebot will crawl in one session. Logs show exactly how much of that budget goes to technical URLs (filter params, session IDs, pagination) versus content pages. If over 30% of crawls hit junk URLs, block them via robots.txt or noindex.

What tools are best for analyzing server logs for SEO?

For no-code analysis: Screaming Frog Log File Analyser or GoAccess (both free). For large logs and automation: Python with pandas — filter by User-Agent, pivot by URL and date, detect anomalies across time periods.

If you want to understand the full technical state of your site, read our step-by-step technical SEO audit guide.

Free Technical Crawl Audit

We'll analyze your server logs, identify crawl budget leaks, and deliver a concrete indexation optimization plan — at no cost.

Technical SEO audit · SEO promotion

Аналіз логів сервера для SEO: повний гід з Googlebot, Crawl Budget та індексації

What are access.log and error.log

Where to find log files

How to read a log line

What logs reveal about Googlebot

Googlebot User-Agent strings to filter by

Log analysis tools

GoAccess: quick start

Screaming Frog Log File Analyser

Case study: 40% of crawl budget on technical URLs

What we did

What to look for in logs: a checklist

1. Googlebot 404 errors

2. 5xx server errors

3. Redirect chains

4. Priority pages with rare crawling

Crawl budget through logs: measure and optimize

Crawl efficiency formula

What to block from crawling

Python script for automated log analysis

In Practice

Frequently asked questions

Where are server log files located?

How often does Googlebot crawl a site?

What is crawl budget and how do logs reveal it?

What tools are best for analyzing server logs for SEO?

Free Technical Crawl Audit