У кошику порожньо!
Server log files record every Googlebot request to your site — URL, status code, timestamp, frequency. Analyzing 30 days of logs reveals where crawl budget leaks, which pages the bot ignores, and where technical errors hide before they damage rankings.
Contents
What are access.log and error.log
Every web server maintains two core journals. access.log records every HTTP request the server receives — from real visitors, bots, monitoring systems. error.log captures failures: PHP errors, missing files, configuration issues.
For SEO, access.log is the primary source. It shows exactly when and how often Googlebot visited, which URLs it requested, and what status codes it received. Unlike Google Search Console — which is sampled, delayed, and filtered — logs are the raw, unedited record of every exchange between your server and the search robot.
Where to find log files
| Server | access.log location | error.log location |
|---|---|---|
| Apache (Linux) | /var/log/apache2/access.log | /var/log/apache2/error.log |
| Nginx (Linux) | /var/log/nginx/access.log | /var/log/nginx/error.log |
| cPanel | cPanel → Logs → Raw Access | cPanel → Logs → Error Log |
| Windows IIS | C:\inetpub\logs\LogFiles\ | Same folder as access logs |
| Plesk | Logs → GUI access | Logs → Error Log |
How to read a log line
A typical Combined Log Format entry (the default for both Apache and Nginx) looks like this:
66.249.66.1 - - [21/May/2026:14:23:11 +0200] "GET /blog/seo-audit/ HTTP/1.1" 200 48234 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
From this single line we learn: IP 66.249.66.1 (confirmed Googlebot via reverse DNS lookup), on May 21, 2026 at 14:23, made a GET request to /blog/seo-audit/, the server returned status 200 (success), response size 48 KB, and the User-Agent confirms Googlebot 2.1.
What logs reveal about Googlebot
Filtering access.log by Googlebot's User-Agent string surfaces four key SEO data groups:
- Crawl frequency: how many times Googlebot visited per day, week, or month. Based on our log analysis across 50+ client sites, high-quality new articles on established domains get crawled within 2–6 hours of publication.
- URL crawl distribution: which pages the bot visits most frequently and which it barely touches. This is a direct proxy for Google's own assessment of your site's priority pages.
- Status codes for the bot: how many requests returned 200, 301, 404, or 500. GSC shows only a subset of this data, with delays. Logs show everything, real-time.
- Crawl timing: which hours Googlebot is most active on your site. Useful for scheduling deployments, database maintenance, and server-heavy operations.
The most common surprise we find in client logs: the site owner assumes Googlebot is crawling product pages and articles, but 60–70% of bot requests are actually hitting static assets (CSS, JS, fonts) and URL parameter variations. The bot sees far fewer content pages than anyone realised.
Googlebot User-Agent strings to filter by
| Bot | User-Agent string | What it crawls |
|---|---|---|
| Googlebot | Mozilla/5.0 (compatible; Googlebot/2.1; ...) | HTML pages — the main crawler |
| Googlebot-Image | Googlebot-Image/1.0 | Images for Google Images |
| Googlebot-Mobile | ...Mobile; Googlebot... | Mobile version of your site |
| AdsBot-Google | AdsBot-Google (+http://...) | Landing pages for ad quality checks |
| Google-InspectionTool | Google-InspectionTool/1.0 | URL Inspection in GSC |
Log analysis tools
Tool choice depends on log file size and the depth of analysis you need:
| Tool | Type | Log size | Price | Best for |
|---|---|---|---|---|
| GoAccess | CLI / web UI | up to 1 GB | Free | Fast overview, real-time mode |
| Screaming Frog Log Analyser | Desktop app | up to 1 GB | Free / $259/yr | SEO-focused reports, bot filters |
| Python + pandas | Code | Unlimited | Free | Large logs, automation, custom metrics |
| ELK Stack | Server-side | Terabytes | Free / Enterprise | Enterprise sites, dashboards |
| Splunk | Enterprise | Unlimited | $150+/mo | ML anomaly detection, alerting |
GoAccess: quick start
GoAccess is the fastest way to see a general crawl picture without writing code. Run it directly on your server:
# Install on Ubuntu/Debian
sudo apt-get install goaccess
# Analyze and generate an HTML report
zcat /var/log/nginx/access.log.gz | \
grep -i "googlebot" | \
goaccess - -o /tmp/googlebot-report.html \
--log-format=COMBINED --no-global-config
Screaming Frog Log File Analyser
For teams without Python experience, this is the most accessible option. Load the log, set a User-Agent filter for "Googlebot," and get prebuilt reports: top URLs by crawl count, status code breakdown, activity charts by date. The free version handles up to 1,000 log lines — enough to verify the workflow before committing to a full analysis.
Case study: 40% of crawl budget on technical URLs
One of our clients ran an e-commerce store with approximately 8,000 SKUs in the home goods category. Organic traffic had dropped 18% over a quarter despite consistent content updates and weekly product page publications. GSC showed no critical indexation errors.
We pulled access.log for 30 days (compressed archive: ~380 MB), isolated Googlebot requests, and classified URLs by type. The breakdown:
- 40% of Googlebot requests — filter parameter URLs:
?color=red&size=L&sort=price - 15% of requests — session parameter URLs:
?PHPSESSID=abc123... - 12% of requests — pagination duplicates:
/catalog/page-1/(even though canonical already pointed to/catalog/) - 33% of requests — actual product and category pages that should have been indexed
Only a third of the crawl budget reached valuable URLs. New product pages published weekly were queuing up 2–3 weeks before Googlebot saw them, instead of the 2–3 days expected.
What we did
- Added robots.txt Disallow rules for all filter and session parameters:
Disallow: /*?*color=,Disallow: /*?*PHPSESSID=. - Set a proper canonical on all pagination pages pointing to the first page of each category (Google dropped support for rel=next/prev years ago).
- Audited sitemap.xml — removed technical URLs that had snuck in via a bug in the sitemap generator.
Six weeks later, the useful crawl share rose from 33% to 74%. New product pages started appearing in the index within 2–5 days. Organic traffic recovered and added an additional +11% above the previous baseline.
What to look for in logs: a checklist
1. Googlebot 404 errors
Every 404 response to Googlebot is a crawl budget waste on a page that no longer exists. Common causes: deleted pages without redirects, stale URLs in sitemap.xml, internal links pointing to removed content.
# Find all Googlebot 404s, sorted by frequency
grep -i "googlebot" access.log | grep '" 404 ' | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -50
2. 5xx server errors
A 500, 502, or 503 response to Googlebot signals the server failed to process the request. When the bot systematically receives 5xx responses, it marks those pages as unreliable and reduces crawl frequency over time. We've seen even 2–3 days of server instability during peak load result in weeks of suppressed crawl activity.
3. Redirect chains
Each redirect is an extra request and a crawl time cost. If Googlebot routinely follows chains (A→B→C), update the source links to point directly to the final destination URL.
4. Priority pages with rare crawling
Cross-reference your sitemap.xml priority URLs against their log frequency over 30 days. A strategic page — category page, landing page, cornerstone article — that hasn't appeared in Googlebot logs for 14+ days needs a health check: confirm it's accessible via internal links, has no noindex tag, and is included in your sitemap.
Crawl budget through logs: measure and optimize
Crawl budget is the number of pages Googlebot will crawl on your site in a given session. It's not static — it adjusts based on domain authority, server response time, and how "rewarding" the bot finds your content.
From logs you can measure both components of crawl budget described in Google's official crawl budget documentation:
A detailed overview of all GSC tools and reports for SEO analysis — in our complete Google Search Console guide.
- Crawl rate limit: the maximum speed at which the bot can crawl without overloading your server. Logs reveal how many simultaneous requests Googlebot makes and how quickly your server responds.
- Crawl demand: how "interesting" Google considers your site. If crawl frequency increases after a content quality improvement, that's a measurable positive signal visible in the logs.
Crawl efficiency formula
Crawl efficiency = (Googlebot requests to content URLs) / (Total Googlebot requests) × 100%
Target: above 60%. Below this threshold, budget is leaking to junk URLs.
What to block from crawling
- Filter and sort parameters:
?sort=price&order=asc - Session identifiers:
?PHPSESSID=,?sessionid= - UTM parameters:
?utm_source=,?utm_medium= - www / non-www duplicates: if no hard 301 redirect is in place
- Internal site search pages:
/search?q= - Login and account pages:
/login,/account/ - Staging and test URLs: if accidentally publicly accessible
Python script for automated log analysis
For regular monitoring — weekly or monthly — automating the analysis with a script pays off quickly. Here's a baseline Python script using pandas that we run as the first pass before any deeper investigation:
import pandas as pd
import re
LOG_FILE = '/var/log/nginx/access.log'
LOG_PATTERN = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) \S+" (\d+) \d+ "([^"]*)" "([^"]*)"'
rows = []
with open(LOG_FILE, 'r', encoding='utf-8', errors='ignore') as f:
for line in f:
m = re.match(LOG_PATTERN, line)
if m:
ip, dt, method, url, status, referer, ua = m.groups()
rows.append({
'ip': ip, 'datetime': dt, 'method': method,
'url': url, 'status': int(status), 'ua': ua
})
df = pd.DataFrame(rows)
# Filter for Googlebot only
bot_df = df[df['ua'].str.contains('Googlebot', case=False, na=False)].copy()
print(f"Total Googlebot requests: {len(bot_df)}")
print(f"\n=== Status code distribution ===")
print(bot_df['status'].value_counts())
print(f"\n=== Top 20 URLs by crawl frequency ===")
print(bot_df['url'].value_counts().head(20))
# Classify URLs
def classify_url(url):
if any(p in url for p in ['?', 'PHPSESSID', 'utm_', 'sort=', 'filter']):
return 'technical'
elif any(ext in url for ext in ['.css', '.js', '.png', '.jpg', '.svg', '.woff']):
return 'static'
else:
return 'content'
bot_df['url_type'] = bot_df['url'].apply(classify_url)
print(f"\n=== URL type distribution (%) ===")
print(bot_df['url_type'].value_counts(normalize=True).mul(100).round(1))
# 404s for Googlebot
errors_404 = bot_df[bot_df['status'] == 404]['url'].value_counts().head(30)
print(f"\n=== Top 30 URLs returning 404 to Googlebot ===")
print(errors_404)
The script outputs: total Googlebot request count, status code breakdown, top 20 URLs by crawl frequency, URL type distribution (technical / static / content), and the top 404 URLs. This is the starting point for every log audit we run for clients.
Log file analysis is a core part of a thorough technical SEO audit. When working on website promotion, logs give you data no other tool can: an unfiltered record of exactly what Googlebot sees on your server, when, and how often.
In Practice
An analytics SaaS platform with public client dashboards — approximately 4,200 product pages plus an open-ended volume of user-generated report URLs accessible without authentication. Organic traffic to priority pages (pricing, features, documentation) was growing far slower than expected despite consistent content output.
GSC showed almost no indexation activity for core product pages. The client assumed a content quality issue. We pulled 45 days of access.log, a 420 MB archive, and loaded it into Screaming Frog Log File Analyser alongside a Semrush crawl of the indexed pages.
The data told a different story: 71% of all Googlebot requests were hitting URLs in the pattern /dashboard/client-id/report/ — public client dashboards, thousands of unique addresses carrying zero SEO value. Googlebot was methodically cycling through them while barely touching feature pages and landing pages.
After adding Disallow: /dashboard/ to robots.txt and verifying the shift via GSC Crawl Stats, indexation of core product pages grew 4x over 8 weeks. Semrush tracked a 38% visibility increase across target keywords.
Public user-generated URLs are a crawl budget trap unique to SaaS: they look like content to Googlebot but deliver no ranking value. If your platform generates publicly accessible per-user pages, checking the logs for this pattern is the first thing to do — before touching a single piece of content.
Frequently asked questions
Where are server log files located?
On Apache: /var/log/apache2/access.log and error.log. On Nginx: /var/log/nginx/access.log. In cPanel: Logs → Raw Access or File Manager → /logs/ directory. Windows IIS stores logs under C:\inetpub\logs\LogFiles\.
How often does Googlebot crawl a site?
Crawl frequency depends on domain authority and content update rate. High-authority sites are crawled multiple times daily; smaller sites may see Googlebot every few days. Exact numbers come only from log files: filter for Googlebot and count entries by date.
What is crawl budget and how do logs reveal it?
Crawl budget is the number of pages Googlebot will crawl in one session. Logs show exactly how much of that budget goes to technical URLs (filter params, session IDs, pagination) versus content pages. If over 30% of crawls hit junk URLs, block them via robots.txt or noindex.
What tools are best for analyzing server logs for SEO?
For no-code analysis: Screaming Frog Log File Analyser or GoAccess (both free). For large logs and automation: Python with pandas — filter by User-Agent, pivot by URL and date, detect anomalies across time periods.
If you want to understand the full technical state of your site, read our step-by-step technical SEO audit guide.
Free Technical Crawl Audit
We'll analyze your server logs, identify crawl budget leaks, and deliver a concrete indexation optimization plan — at no cost.


