Robots.txt and Sitemap.xml: Setup and Common Mistakes

Publication date: 05.06.2026 15:35

Robots.txt tells Google which pages to crawl; sitemap.xml tells it which pages matter. Getting both files right means stable indexing with no blind spots and no wasted crawl budget on admin or checkout URLs.


What is robots.txt and how it works

Robots.txt is a plain text file placed in the root directory of your website that instructs search engine bots which URLs they are allowed to crawl and which they should skip. Before crawling any page, Googlebot fetches yourdomain.com/robots.txt and reads the rules. This convention is called the Robots Exclusion Protocol and is respected by all major search engines.

Two concepts that are frequently confused:

  • Crawling — whether a bot can visit a URL. Robots.txt controls this.
  • Indexing — whether a page appears in search results. Controlled by the noindex meta tag or X-Robots-Tag HTTP header.
Robots.txt blocks crawling, but it does not guarantee removal from the index. If external links point to a blocked page, Google may still index it — just without seeing the content.

The file lives at: https://yourdomain.com/robots.txt. It must be in the root of the domain, not a subdirectory. Subdomains have their own separate robots.txt files.

How robots.txt works Googlebot wants to crawl robots.txt reads rules Allow Disallow Crawls URL adds to index Skips URL does not crawl A blocked page can still appear in the index if external links point to it. To remove from index — use noindex tag.
Diagram: how Googlebot reads robots.txt and decides whether to crawl a URL

Robots.txt directives: full reference

The file is made up of blocks — each block starts with a User-agent directive and contains one or more rules. Here is the complete list of directives:

DirectiveWhat it doesExample
User-agentSpecifies which bot the rules apply to. * means all bots.User-agent: Googlebot
DisallowBlocks crawling of a URL or directoryDisallow: /admin/
AllowPermits crawling of a specific URL inside a blocked directoryAllow: /admin/public/
SitemapSpecifies the absolute URL of the sitemap.xml fileSitemap: https://site.com/sitemap.xml
Crawl-delayDelay between bot requests (ignored by Googlebot)Crawl-delay: 10

Example of a typical robots.txt for an e-commerce store:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /thank-you/
Disallow: /account/
Disallow: /search/
Allow: /

User-agent: Googlebot-Image
Disallow: /private-images/

Sitemap: https://yourdomain.com/sitemap.xml
Rule priority: when Disallow and Allow conflict, the more specific rule wins. If both rules have equal length, Allow takes priority. This is Googlebot's behavior — other bots may interpret rules differently.

Syntax details that frequently cause confusion:

  • Disallow: /page — blocks /page, /page/, /page-about (anything starting with /page)
  • Disallow: /page/ — blocks only the /page/ directory and everything inside it
  • Disallow: (empty) — allows everything
  • Disallow: / — blocks the entire site
  • The * wildcard in a path matches any sequence of characters: Disallow: /page/*?sort=
  • The $ character marks the end of a URL: Disallow: /*.pdf$ — only .pdf files

What is sitemap.xml and its types

Sitemap.xml is an XML file that lists the URLs on your site along with metadata for search engines: the date of last modification, update frequency, and priority. Its main purpose is to help bots discover all important pages — especially those with no internal links pointing to them.

Structure of a basic sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/page/</loc>
    <lastmod>2026-05-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap types and when to use each:

TypePurposeWhen to use
Standard (urlset)List of website page URLsAlways
Sitemap IndexIndex file pointing to multiple sitemapsSite with >50,000 URLs or multiple content types
Image SitemapImage URLs with alt text and captionsE-commerce stores, photography portfolios
Video SitemapVideo metadata (duration, thumbnail)Sites hosting their own video content
News SitemapArticles published within the last 2 daysNews publishers (Google News)
Hreflang SitemapLanguage versions of pagesMultilingual websites

Example Sitemap Index for a large site:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yourdomain.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-01</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yourdomain.com/sitemap-products.xml</loc>
    <lastmod>2026-05-01</lastmod>
  </sitemap>
</sitemapindex>
Sitemap.xml types hierarchy Sitemap Index sitemap.xml Pages sitemap-pages.xml Products sitemap-products.xml Images sitemap-images.xml Video sitemap-video.xml News sitemap-news.xml Sitemap.xml limits (Google) Max URLs per file: 50,000 Max file size: 50 MB uncompressed Sitemaps in index: unlimited News Sitemap: articles from last 2 days only | Image Sitemap: up to 1,000 images per <url>
Sitemap.xml types hierarchy and key Google limits

Common robots.txt mistakes

Based on our technical SEO audit practice, robots.txt errors appear in roughly one out of every three client projects. Most are critical: the site is either invisible to Google or wasting crawl budget on admin pages and filtered URLs.

All critical checks are listed in our step-by-step technical SEO audit guide.

Real case: a furniture e-commerce client came to us after a platform migration — organic traffic had dropped 80% in one month. We found that the developer had left Disallow: / under User-agent: * in the live site's robots.txt — a leftover from staging. The result was three weeks of complete crawl blockage by Google.

MistakeExampleConsequenceFix
Entire site blockedDisallow: /Googlebot does not crawl any pageChange to Allow: / or remove the line
CSS and JS blockedDisallow: /assets/Google cannot render pages, sees raw HTMLRemove the restriction — CSS/JS must be open
Key sections blockedDisallow: /blog/Entire blog drops from the indexOpen the section or specify exact subdirectories
Allow/Disallow conflictDisallow: /cat/ + Allow: /cat/Unpredictable bot behaviorRemove the duplicate, keep the more specific rule
No Sitemap directive(no Sitemap: line)Bots do not locate sitemap automaticallyAdd Sitemap: https://site.com/sitemap.xml
Duplicate User-agent blocksTwo User-agent: * blocksUnpredictable processing, second block may be ignoredMerge into a single block
Parameter URLs not blockedNot blocking /?sort=, /?page=Bots crawl thousands of duplicate pagesDisallow: /*?sort=, Disallow: /*?page=

One specific mistake common on WordPress and OpenCart sites: blocking /wp-admin/ without preserving access to admin-ajax.php, which breaks dynamic front-end functionality:

# WRONG:
Disallow: /wp-admin/

# CORRECT:
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

How to set up sitemap.xml

There are two approaches: automatic generation via a CMS plugin and manual creation. The plugin approach works for most sites; manual setup is for custom structures or large sites with complex logic.

Automatic generation (CMS plugins):

  1. WordPress: Yoast SEO or Rank Math → Sitemap section → enable and select content types
  2. OpenCart: Google Sitemap extension (free in the marketplace) or an OCmod SEO module
  3. PrestaShop: Google Sitemap XML module → configure automatic updates via cron
  4. Bitrix: built-in Site Map module → SEO → Sitemap → Generate
  5. Custom site: a PHP or Python script that queries the database and builds the XML file

When manual setup is necessary:

  • You need to include only specific categories or pages, not the entire site
  • You need image or video sitemaps — most plugins generate these incompletely
  • You need accurate lastmod dates from the database, not the current date

What to include and exclude:

  • Include: all indexed commercial pages, category pages, blog articles, landing pages
  • Exclude: pages with noindex, duplicate pages, pagination pages (in most cases), service URLs (admin, cart, thank-you), URLs blocked in robots.txt
Quick validity test: open the sitemap URL in a browser. Chrome should display formatted XML. If you see a parsing error or a blank screen, there is a syntax error in the file. Check encoding (UTF-8 without BOM) and special character escaping (& → &amp;, < → &lt;).

The two files work as a pair: robots.txt defines what Google can crawl; sitemap.xml tells Google where the priority pages are. A correct configuration requires three levels of consistency between them.

Robots.txt and sitemap.xml interaction Googlebot starts crawl robots.txt sets rules sitemap.xml URL list Google Index pages in search Three consistency rules 1. URLs in sitemap.xml must be OPEN in robots.txt 2. Sitemap.xml must not include pages tagged with noindex 3. Robots.txt must include a Sitemap: line with the absolute URL
Flowchart: how robots.txt and sitemap.xml interact during crawling and indexing

Where to reference your sitemap:

  1. In robots.txt — add Sitemap: https://yourdomain.com/sitemap.xml at the end. Google reads this automatically every time it fetches robots.txt
  2. In Google Search Console — Sitemaps → Submit → enter the sitemap URL. This gives direct feedback: you see the status, the number of URLs, and the last read date
  3. Direct URL — Google also finds sitemap.xml automatically if it is placed at the standard location (/sitemap.xml)

Critical: the URLs in sitemap.xml must exactly match what is open in robots.txt. If a page is listed in the sitemap but blocked in robots.txt, Google will flag a conflict and may either skip the URL entirely or index it without reading the content (if external links exist).

Unlike Yandex, Google does not guarantee that every URL in your sitemap will be indexed. A sitemap is a recommendation, not a directive. Without a sitemap, bots discover pages only through links — and can miss isolated or deeply nested URLs entirely.

How to test robots.txt and sitemap

There are several reliable testing methods. At SEO-Factory we use them in order of increasing depth — from a quick manual review to automated discovery through a full technical SEO audit.

Testing robots.txt:

  1. Manually in the browser: open yourdomain.com/robots.txt — check for critical Disallow rules and the Sitemap line
  2. Google Search Console → Settings → robots.txt: shows how Google reads the file, with problem highlighting. The built-in URL tester lets you check whether Googlebot can access specific pages
  3. Official documentation: Google's robots.txt reference points to the GSC tester as the primary tool
  4. Screaming Frog: Configuration → Robots.txt — shows blocked pages and visualizes rule conflicts

Learn more about the tool's capabilities in our article on Google Search Console for SEO.

Testing sitemap.xml:

  1. Open the URL in the browser — Chrome renders XML in a formatted view; parse errors are immediately visible
  2. Google Search Console → Sitemaps: submit the URL and check "Discovered URLs" vs "Indexed" after 24–48 hours
  3. Ahrefs → Site Audit → Data Explorer → Sitemap: shows how many URLs are in the sitemap, how many are indexed, and whether there are any conflicts
Warning sign: if "Submitted URLs" in GSC significantly exceeds "Indexed URLs" (e.g. 5,000 submitted vs 800 indexed) — you likely have duplicate pages in the sitemap, pages with noindex, or URLs blocked in robots.txt. Analyzing this gap is the first step in any indexing audit.

Practical checklist

After any significant site update — redesign, platform migration, new CMS — run through this checklist. According to Search Engine Land, most critical crawl issues surface immediately after technical changes.

Robots.txt — checks:

  1. File is accessible — opens at /robots.txt with HTTP 200
  2. No Disallow: / for User-agent: * — the single most critical error to check
  3. CSS and JS files are open — Google must be able to render pages
  4. Service URLs are blocked — /admin/, /cart/, /checkout/, /account/, /search/
  5. Parametric URLs are blocked?sort=, ?color=, ?page= (if they create duplicates)
  6. Sitemap: line is present — with an absolute URL, not a relative path
  7. No duplicate User-agent blocks — one block per bot
  8. Tested in GSC Tester — verified 3–5 key URLs

Sitemap.xml — checks:

  1. File opens without errors — valid XML, UTF-8 encoding without BOM
  2. URLs are absolute — include protocol and domain, not relative paths
  3. No noindex pages — cross-check against meta tags
  4. No URLs blocked in robots.txt — no conflicts between the two files
  5. lastmod dates are accurate — reflect the actual update date, not today's date
  6. 50,000 URL limit not exceeded — use Sitemap Index if the count is higher
  7. Submitted in GSC — status shows "Success"
  8. Auto-update is configured — a cron job or CMS plugin refreshes the sitemap when new pages are published

In Practice

A Kyiv real estate agency came to us with a sharp traffic drop — 55% organic decline over three weeks. The site runs on WordPress with around 12,000 property listings under /listings/, filtered by district and property type.

An initial GSC review showed a mass shift of pages into "Discovered — currently not indexed" status, even though the sitemap was returning HTTP 200 without errors. Screaming Frog identified the cause immediately: the entire /listings/ section was returning "Blocked by robots.txt".

The culprit was a routine plugin update. All in One SEO, upgrading to version 4.x, silently rewrote the robots.txt file and inserted Disallow: /listings/ into the global User-agent block — apparently due to a conflict with custom rules stored in the plugin settings. Over 21 days approximately 11,400 listing pages dropped out of the index.

After restoring the correct robots.txt, force-resubmitting the sitemap through GSC, and validating coverage with Ahrefs Site Audit, indexing recovered steadily: 9,800 pages were back in the index within four weeks and traffic returned to 78% of the original level.

If you run a WordPress SEO plugin, manually check robots.txt after every plugin update. All in One SEO, Rank Math, and Yoast can all overwrite the file silently during major version upgrades. Three minutes in the GSC Tester can save three weeks of index recovery.

FAQ

Can robots.txt completely block a site from indexing?

Yes. The directive Disallow: / under User-agent: * prevents all bots from crawling every page. However, this does not remove pages from the index — use the noindex tag or the URL removal tool in Google Search Console for that.

How many sitemap files can a website have?

There is no limit — you can have dozens of sitemap files using a sitemap index. A single sitemap can contain a maximum of 50,000 URLs and must not exceed 50 MB when uncompressed.

Is Crawl-delay required in robots.txt?

No. Googlebot ignores Crawl-delay and sets its own crawl rate based on server response time. The directive matters for Yandex and some other bots.

What should I do if sitemap.xml returns a 404 error?

Verify the path is correct — the sitemap is typically at yourdomain.com/sitemap.xml. If a CMS plugin generates it, check that the plugin is active and directory permissions are correct. After fixing the issue, resubmit the sitemap URL in Google Search Console.

Need a technical audit of your robots.txt and sitemap?

We will review the file configuration, find conflicts, and provide clear recommendations — included in our free technical site audit.

SEO indexation audit  ·  SEO promotion

Seo Factory
The content published on SEO-FACTORY is created by a team of specialists in SEO, digital marketing, PPC advertising, and web analytics. The main goal of the project is to provide practical and easy-to-understand materials that help businesses, website owners, and marketers better understand modern Google algorithms, SEO principles, and online promotion strategies. The authors regularly work with commercial projects in Ukraine and international markets, testing SEO strategies, analyzing search algorithm updates, studying behavioral ranking factors, link building, AI search technologies, content marketing, and Google Ads campaigns. Because of this, the published materials are based not only on theory but also on real-world practical experience. Articles on SEO-FACTORY include: up-to-date market data and industry research; practical insights and real case studies; analysis of Google updates and SEO trends; technical optimization recommendations; modern approaches to increasing organic traffic. The project focuses on creating expert-level content without generic advice or unnecessary filler. The main emphasis is placed on practical value, clear explanations, and modern digital marketing approaches relevant
Latest
Robots.txt and Sitemap.xml

05.06.2026 15:35

Robots.txt and Sitemap.xml
URL Structure

04.06.2026 11:29

URL Structure
Website Loading Speed

03.06.2026 11:08

Website Loading Speed