Does blocking GPTBot remove my content from ChatGPT?

Blocking GPTBot stops future training runs from using your content, but it does not remove content already included in prior model snapshots. It also does not affect ChatGPT-User, the live-browse agent that cites pages in real time.

Is robots.txt enough to block AI crawlers?

Robots.txt is honored by every major AI vendor (OpenAI, Anthropic, Perplexity, Google, Meta, Apple). It is not honored by rogue scrapers that spoof user agents, so pair robots.txt with server-level user-agent blocking and ASN rate limits for full coverage.

Should I block ChatGPT-User and Perplexity-User too?

Those are live-browse agents that fetch pages when a user asks a question. Blocking them removes your site from chat-based answers and the referral traffic that comes with citations. Many publishers block training bots (GPTBot, ClaudeBot, CCBot) but allow live-browse agents.

How do I confirm a blocked bot is actually blocked?

Run curl with the bot user agent (for example, curl -A "GPTBot" https://yourdomain.com). You should get a 403. Then tail your access logs and confirm the bot traffic receives 403 responses. Reverse DNS can also help confirm suspect requests came from the claimed vendor.

How to Block AI Scrapers: GPT, Claude and Perplexity Bots

This guide covers: How to Block AI Scrapers: GPT, Claude and Perplexity Bots.

AI crawlers from OpenAI, Anthropic, Perplexity, Google, and others now request millions of pages every day to train models and to answer live search queries. If you run a site, you may want to allow some bots, throttle others, or block the whole class. This guide covers the three layers that actually work: robots.txt, user-agent rules, and IP-level blocking at the edge.

How to block AI scrapers: GPTBot, ClaudeBot, PerplexityBot, and CCBot with robots.txt, user-agent rules, and ASN-level limits

The three layers of AI bot control

Blocking AI scrapers is not a single switch. Well-behaved crawlers respect robots.txt. Misbehaving or rotating crawlers need user-agent filtering at the web server or CDN. And aggressive scrapers that rotate user agents can only be stopped with IP or ASN level rules combined with rate limiting.

Layer 1 - robots.txt: declarative, respected by the major AI bots that document their crawler.
Layer 2 - user-agent rules: enforced at Nginx, Apache, Cloudflare, or similar, for bots that ignore robots.txt.
Layer 3 - IP or ASN blocks: for scrapers that rotate user agents or hide behind residential proxies.

Known AI crawler user agents

These are the documented user agents used by the major AI companies as of 2026. Match on a substring rather than the full string, because vendors append versions and platform tokens.

OpenAI: GPTBot, OAI-SearchBot, ChatGPT-User
Anthropic: ClaudeBot, Claude-Web, anthropic-ai
Perplexity: PerplexityBot, Perplexity-User
Google AI: Google-Extended (training opt-out token), GoogleOther
Common Crawl: CCBot (feeds many model training datasets)
ByteDance: Bytespider
Amazon: Amazonbot
Meta: Meta-ExternalAgent, FacebookBot
Apple: Applebot-Extended
Other: cohere-ai, Diffbot, Omgilibot, YouBot

Layer 1: robots.txt

Drop this block into the top of your robots.txt to opt out of the main AI training and answer bots. It is honored by every major vendor on this list.

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

A few notes. Google-Extended is not an actual crawler, it is a training opt-out token - disallowing it stops Gemini training but does not affect regular Google search indexing. ChatGPT-User and Perplexity-User are live-browse agents - blocking them prevents your pages from being cited in chat answers, so weigh that against the referral traffic.

Layer 2: user-agent rules at the server or CDN

Honest bots respect robots.txt. Dishonest bots ignore it. The next layer is to return a 403 to matching user agents before your app sees the request.

Nginx example

map $http_user_agent $blocked_ai_bot {
    default 0;
    ~*(GPTBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider|Amazonbot|cohere-ai|Diffbot) 1;
}

server {
    if ($blocked_ai_bot) {
        return 403;
    }
}

Cloudflare Workers example

export default {
  async fetch(request) {
    const ua = request.headers.get('user-agent') || '';
    const blocked = /GPTBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider|Amazonbot/i;
    if (blocked.test(ua)) {
      return new Response('Forbidden', { status: 403 });
    }
    return fetch(request);
  },
};

If you are on Cloudflare without Workers, their WAF has a one-click Block AI Scrapers and Crawlers rule that covers most of this list and is updated as new bots appear.

Layer 3: IP and ASN level blocking

When a scraper rotates user agents or uses residential proxies, user agent matching stops working. At that point you need to look at where the traffic comes from. Start by pulling suspect IPs out of your access logs, then check the ASN with an ASN lookup.

If most of the high-volume, low-conversion traffic lands on a handful of cloud ASNs (AWS, GCP, Azure, Hetzner, DigitalOcean, OVH), you have two options:

Rate-limit by ASN: allow a low request budget per minute, since cloud ASNs rarely host real browsers.
Outright block specific ASNs or IP ranges that show abusive patterns.

Be careful with blanket cloud blocks. AWS and GCP also host legitimate services, health checkers, monitoring, and SEO tools. Start with rate limits before hard blocks.

How to verify a block is working

After you roll out a rule, validate from both sides so you do not accidentally block real users.

Tail your access log and filter by the user agents listed above. You should see the bots hitting a 403 response, not a 200.
Run curl -A "GPTBot" https://yourdomain.com and confirm you get a 403.
Run the same curl with a real browser user agent and confirm a 200.
Use reverse DNS lookup on suspect IPs. Most AI bots resolve to vendor-owned hostnames (for example, Common Crawl resolves to crawl-*.commoncrawl.org), so you can sanity-check that the traffic really came from the claimed bot.

What blocking AI bots actually changes

A realistic expectation matters. Blocking training bots like GPTBot, ClaudeBot, and CCBot means future model versions are less likely to train on your content. It does not remove your content from models already trained on past snapshots. Blocking live-browse bots like ChatGPT-User or Perplexity-User means chat assistants cannot cite your page in real-time answers, which removes a growing channel for referral traffic.

When to allow AI bots

Publishers are increasingly splitting the decision: block training crawlers to protect original content, but allow live-browse agents so AI search products still surface the site. The minimum effort version of that is:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

This pattern says: do not train on my content, but feel free to link to it when a user asks a question that references it.

Apache and Caddy configurations

Nginx is the most common example, but Apache and Caddy are still heavily used for publishers and self-hosted blogs. The patterns are equivalent but the syntax differs.

Apache (.htaccess or virtual host)

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider|Amazonbot|cohere-ai|Diffbot|Applebot-Extended|Meta-ExternalAgent) [NC]
RewriteRule .* - [F,L]

The [F] flag returns 403 Forbidden, and [L] stops processing further rewrite rules. Test the rule by curling with one of the user agents and verifying the 403, then curl a matching real browser user agent to confirm normal traffic still gets through.

Caddy v2

example.com {
    @aibots {
        header_regexp User-Agent (?i)(GPTBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider|Amazonbot)
    }
    respond @aibots 403

    root * /srv
    file_server
}

Caddy's matcher syntax is compact. The header_regexp directive does a case-insensitive match on the user agent header. Because the matcher is evaluated before the file server, blocked requests never touch disk, which keeps load low even if a scraper is banging on the origin.

HAProxy

frontend http-in
    bind *:443 ssl crt /etc/ssl/certs
    acl block_ai_bot hdr_sub(user-agent) -i GPTBot ChatGPT-User ClaudeBot anthropic-ai PerplexityBot CCBot Bytespider Amazonbot
    http-request deny if block_ai_bot
    default_backend app

Platform-specific instructions

Most site owners are not hand-editing Nginx. They run WordPress, Ghost, Squarespace, Webflow, Next.js on Vercel, or similar. Here are the concrete recipes for the most common platforms.

WordPress

The cleanest approach on WordPress is a plugin like Blackhole for Bad Bots or AI-Blocker for WordPress, which maintain a regularly updated user-agent list and expose a toggle. If you prefer zero plugins, drop the robots.txt block above into a static robots.txt file at your web root, and add the Apache .htaccess rewrite for user-agent enforcement. Managed hosts like Kinsta, WP Engine, and Pressable provide server-level toggles for AI scraper blocking on newer plans.

Ghost

Ghost serves a static robots.txt by default, so your block rules live in content/data/robots.txt (self-hosted) or the custom robots.txt field in Ghost(Pro). For user-agent enforcement, you need the nginx or Caddy reverse proxy in front of Ghost - Ghost itself does not block by user agent. If you host Ghost on Digital Ocean or Linode, put the nginx block in the reverse proxy config.

Next.js on Vercel

Vercel lets you enforce user-agent rules two ways. The simplest is a middleware.ts file at the root of the project:

import { NextRequest, NextResponse } from 'next/server';

const AI_BOTS = /GPTBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider|Amazonbot|cohere-ai/i;

export function middleware(request: NextRequest) {
  const ua = request.headers.get('user-agent') || '';
  if (AI_BOTS.test(ua)) {
    return new NextResponse('Forbidden', { status: 403 });
  }
  return NextResponse.next();
}

export const config = {
  matcher: '/((?!_next/static|_next/image|favicon.ico).*)',
};

For robots.txt, Next.js can serve it as a static file under public/robots.txt or generate it with a app/robots.ts route handler. Most modern Next sites use the app-router approach, which lets you programmatically build the list of disallowed user agents alongside your sitemap.

Static site generators (Hugo, Jekyll, Astro, Eleventy)

For static sites, robots.txt is easy - put the file in your source tree and the build system copies it to the root. User-agent enforcement lives in the hosting layer: Cloudflare Workers, Vercel middleware, Netlify edge functions, or the reverse proxy on your VPS. A common gotcha: GitHub Pages does not let you set custom headers or run middleware, so robots.txt is your only tool there.

Substack, Medium, and hosted platforms

Hosted platforms like Substack and Medium manage robots.txt centrally. Substack has added per-publication AI opt-out toggles in 2024; Medium has not exposed a direct control and relies on its platform-wide robots.txt. Before moving your writing to a hosted platform, check the current status of AI opt-out controls if you care about this dimension.

Rate limiting: the middle ground

Between full allow and full block, rate limiting is the pragmatic middle ground. The goal is to let polite AI bots crawl at a pace your infrastructure is happy with, while immediately cutting off abusive or rotating scrapers. Rate limiting at the edge keeps the origin unaffected.

Cloudflare rate-limiting rule

In Cloudflare, open Security - WAF - Rate limiting rules and create a rule with these parameters:

Expression: (http.user_agent contains "GPTBot") or (http.user_agent contains "CCBot") (extend with the full user-agent set).
When rate exceeds: 60 requests per minute per source IP.
Then: Block for 1 hour.

This lets legitimate crawling continue within polite limits and bans sources that exceed them. You can tighten or loosen the budget based on your own traffic patterns.

Nginx limit_req

http {
    map $http_user_agent $ai_bot_zone {
        default "";
        ~*(GPTBot|CCBot|ClaudeBot|PerplexityBot) "ai";
    }

    limit_req_zone $ai_bot_zone zone=ai_bots:10m rate=10r/m;

    server {
        location / {
            limit_req zone=ai_bots burst=5 nodelay;
            proxy_pass http://app_upstream;
        }
    }
}

This example allows AI bot traffic up to 10 requests per minute per bot class, with a small burst buffer. Requests beyond that are dropped with 503. Real browsers continue unimpeded because their user agent does not match, so $ai_bot_zone is empty and the rate limit does not apply.

Measuring the impact of blocking

Blocking AI bots is not free. You may lose referral traffic from chat-based answers, long-tail SEO via AI-generated excerpts, or citations in AI search results. Measure the impact so you can make an informed decision rather than a reflexive one.

Before-and-after analytics

Two weeks before rolling out the block, record a baseline:

Organic search traffic by landing page (Google Analytics, Plausible, Fathom).
Referral traffic from AI assistants: look for referrers withchat.openai.com, perplexity.ai, claude.ai, gemini.google.com, copilot.microsoft.com, and similar.
Server-side log counts for each AI user agent over 14 days.

After the block, compare the same metrics over an equal window. The most common pattern is: training-bot traffic drops to zero, live-browse referral traffic drops or stays flat, and organic search is unaffected in the short term.

Long-term signal: has the training cutoff passed?

AI models release with training data cutoffs. If a new model was trained through March 2026 and you blocked GPTBot in February 2026, you likely made the cutoff for that release. If you blocked in May 2026, you probably did not. Model release notes usually publish the cutoff, so you can estimate exposure retroactively.

Legal and licensing considerations

Blocking AI bots is legally straightforward in most jurisdictions: your site, your terms. What is evolving is how AI training fits under copyright law. A few practical notes:

Terms of service matter.Including explicit "no AI training" language in your site terms creates a paper trail if you later need to act on a violation.
TDM (Text and Data Mining) opt-outs apply in the EU. Under the EU DSM Directive, rights holders can opt out of text-and-data-mining for commercial purposes via a machine-readable signal. Robots.txt serves as that signal. If your site has an EU audience, robots.txt is not just good hygiene - it is part of your rights reservation.
Licensing deals exist. Some publishers are cutting paid licensing deals with AI companies. If your content has commercial value, blocking first and then negotiating from a position of scarcity is a recognized playbook.

Real-world scenarios

Scenario 1: news publisher

A mid-sized news publisher rolled out the full block in early 2024 and saw these outcomes: GPTBot, ClaudeBot, and CCBot traffic fell to zero within 24 hours. Referral traffic from ChatGPT dropped by roughly half (because ChatGPT-User was allowed). Google search traffic remained stable. Six months later, the publisher signed a licensing deal with one of the major AI firms and selectively unblocked their bot in exchange for compensation. That progression is common.

Scenario 2: SaaS documentation site

A developer-tools SaaS runs a documentation site. Blocking AI scrapers would remove the site from AI code-generation contexts, which is a strong growth channel (users ask ChatGPT how to integrate, and ChatGPT cites the docs). This SaaS chose to allow all AI bots, because the customer-acquisition benefit outweighed the concern about content being used for training.

Scenario 3: personal blog

A personal blog with original research and essays blocked all AI training bots (GPTBot, ClaudeBot, CCBot, PerplexityBot, Bytespider) and kept ChatGPT-User and Perplexity-User allowed. Traffic from AI referrals stayed roughly constant. The author's preference was that AI models cite the work rather than absorb it, and the setup enforces that split.

Common pitfalls

Robots.txt not at the domain root. It must live at https://yourdomain.com/robots.txt exactly. Subdomain robots.txt does not cover parent domains and vice versa.
Case sensitivity on user-agent matching. Vendors change case over time. Always use a case-insensitive match or lowercase the input before comparing.
CDN caching robots.txt too aggressively. If you update the file, invalidate the CDN cache or you will spend a week wondering why bots are still hitting.
Blocking Googlebot by accident. Google-Extended is the AI training token; Googlebot is regular search. Never disallow Googlebot unless you genuinely want to leave Google Search.
Forgetting about Common Crawl. CCBot feeds training datasets for many vendors indirectly, so leaving it allowed undoes part of your block.

A recommended default configuration

If you want one concrete setup to start with, this is what most small-to-medium publishers end up with in 2026.

Publish a robots.txt that disallows all major training bots, allows live-browse agents, and includes explicit sitemap declarations.
Add a user-agent block at the CDN level (Cloudflare's built-in AI scraper rule, Vercel middleware, or Nginx regex). This catches bots that ignore robots.txt.
Add a rate-limiting rule at 60 requests per minute per AI user-agent source IP, as a safety net for burst crawling.
Put a short "AI training restrictions" clause in your terms of service.
Baseline analytics for two weeks, roll out the block, compare at 30 and 90 days.
Re-audit every six months: new AI bots appear, old ones rename, and vendors update their documented user agents.

Choosing what to block, allow, or throttle

The biggest strategic mistake site owners make is treating every AI crawler as the same thing. In practice there are at least four different classes of traffic, and each one has a different business value. A training crawler that wants to copy your articles into a future model release is not the same as a live-answer agent that cites your page when a user asks a question. A commercial SEO bot hitting your pricing pages every hour is not the same as a browser automation cluster scraping your entire archive with residential IPs.

Before you paste in a giant deny list, decide which goal matters most to your site. Some publishers want to preserve the option to license content later, so they block training bots aggressively. Some documentation sites want maximum visibility in chat answers, so they allow live-browse agents and even allow training on the theory that citations bring more developers into the funnel. Some membership or paywalled sites care primarily about infrastructure cost, so they rate limit any machine traffic that is not part of traditional search.

A useful way to think about policy is to split your site into content classes. Public blog posts might be allowed for live-browse bots but blocked for training. Pricing pages, legal pages, and changelogs might be allowed across the board because the downside is low. Search results, internal search, feed endpoints, and faceted archives should usually be blocked because they create the most waste and the least user value when scraped.

Allow: public reference pages you want cited in live answers.
Throttle: broad archives, category listings, changelogs, and documentation trees that can absorb crawling but should not be hammered.
Block: search results, user profiles, private dashboards, paywalled content, and any page family that creates load without meaningful referral value.

How to map your site before you block anything

The safest blocking program starts with an inventory, not a firewall rule. Pull a month of access logs, group requests by path prefix, and then isolate the user agents and source networks associated with AI bots. You want to know which directories they actually hit, which assets they waste bandwidth on, and whether they are concentrated on a handful of pages or crawling the entire site shallowly.

On a typical content site, the highest-volume bot waste appears in one of five places: tag archives, internal search, feed endpoints, paginated category pages, and image-heavy post templates. AI bots often fetch the same URL variations repeatedly because the site exposes multiple sort orders or query-parameter combinations. If you only block the root article URLs but leave archives and query parameters open, you still pay most of the infrastructure cost.

A simple audit checklist is enough for most teams:

List every public path family under the site.
Mark whether each family is essential for discovery, useful for citation, or mostly operational noise.
Check whether the pages are static, cached HTML, or expensive dynamic renders.
Identify where AI bots spend time now versus where you actually want human users to land.
Apply a separate policy for articles, tools, archives, feeds, and private or authenticated areas.

This is also the right moment to check your traditional SEO posture. If a page family is already noindex, canonicalized away, or blocked from standard search, that is a strong hint that it should also be blocked or rate-limited for AI crawlers.

Cloudflare, Fastly, and AWS WAF patterns that scale

The best place to enforce AI bot policy is usually the edge. If you can reject or throttle a request before it touches the origin, you save compute, cache churn, and logging noise. Most teams do not need custom code for the first version. A short WAF rule set plus robots.txt gets you most of the way there.

Cloudflare WAF rule pattern

Cloudflare is currently the easiest platform for small and medium sites because it can combine user-agent matching, IP reputation, ASN context, and rate limiting in a single edge policy. A good production setup is not just "block AI scrapers." It is usually three rules:

Block the training bots you have explicitly opted out from.
Challenge or rate-limit suspicious "AI-like" traffic on cloud ASNs that spoofs normal browser user agents.
Bypass the most expensive dynamic paths for any machine traffic that is not explicitly allowed.

The third rule is underrated. If your origin has tools, search, or highly personalized pages, the difference between serving those routes to machines and cutting them off at the edge can be the difference between a quiet billing cycle and a surprise overage.

Fastly and VCL logic

On Fastly, the same strategy translates into VCL snippets or custom edge dictionaries. Mature teams usually maintain a dictionary for known crawler tokens and another for explicitly allowed partner bots. That separation matters because policy changes faster than code. A content team can update an allow or block entry without waiting for a full deploy.

The edge advantage of Fastly is that you can cheaply normalize request paths before evaluating the rule. If you collapse tracking parameters, duplicate query strings, and alternate path variants before policy evaluation, you reduce both bot waste and cache fragmentation at the same time.

AWS WAF and application load balancers

AWS WAF is better than many people expect for bot control, but it requires slightly more explicit design. The strongest pattern is to combine a label-based rule for user-agent matches with rate-based rules scoped to the same request set. That way polite crawlers can still operate slowly while obvious abuse gets blocked automatically.

If you are running behind an Application Load Balancer, attach the WAF there instead of in the app tier. This keeps rejected requests from waking application servers. Pair it with AWS IP sets only when the vendor publishes stable ranges; otherwise you will spend time managing churn instead of policy.

How to avoid blocking good traffic by accident

False positives are the reason many teams never move past robots.txt. They are worried that blocking a crawler substring or a cloud ASN will accidentally cut off search bots, partner monitors, preview services, or legitimate power users on corporate networks. That concern is justified. The solution is staged rollout, not inaction.

Start by logging or simulating the rule first. Most edge platforms let you count matches without enforcing them. Run the rule in monitor mode for at least a few days, then compare the matched requests against your analytics and known integrations. You want to catch things like uptime monitors, Slack unfurlers, developer documentation fetchers, or mobile preview services that happen to run from cloud ASNs.

Explicit allowlists matter here. If you use Bing Site Scan, Ahrefs, Semrush, Google Search Console URL inspection, or synthetic monitoring, list those agents and IP sources separately so they bypass the AI-bot block. A well-managed allowlist is much safer than assuming everything that is not a browser is disposable.

Keep search-engine verification tools out of broad cloud ASN blocks.
Exclude your own uptime monitors and performance probes from rate limiting.
Do not lump browser preview agents, social unfurlers, and AI bots into one rule.
Recheck partner integrations after every major firewall change.

Monitoring and reporting after rollout

Once the rules are live, treat them like any other production control. They need reporting, thresholds, and a review cadence. At minimum, set up a weekly view that answers four questions: how many AI-bot requests were blocked, how much origin traffic was avoided, which path families were hit most often, and whether any sudden changes appeared in AI referral traffic or regular search traffic.

If your stack supports it, tag blocked requests with a rule name and push them into your log pipeline. That gives you a clean time series by policy. "Blocked by AI training rule" is more useful than a pile of 403 responses. Over a month or two, you can identify whether the biggest savings come from one vendor, one archive section, or one abusive cluster that deserves a stronger IP-level response.

It is also worth tracking the ratio of human sessions to blocked bot hits by landing page. Some pages attract far more machine interest than human interest. That is a sign the content may be overexposed through feeds, weak archives, or duplicated paths. Blocking bots fixes the cost symptom, but it can also highlight structural crawl waste on the site itself.

How blocking AI bots affects SEO and discoverability

The most common fear is that blocking AI bots will somehow damage Google rankings. In the short term, that is usually false. Google's regular search crawler is still Googlebot, and training opt-out is handled through Google-Extended. Blocking AI training crawlers does not tell Google Search to deindex you. The risk is not traditional SEO. The risk is visibility inside AI answer surfaces.

If live-browse agents cannot fetch your pages, they are less likely to cite you in conversational answers. That can reduce referral traffic from ChatGPT Search, Perplexity, and other AI-native discovery paths. Whether that matters depends on your niche. A news publisher may decide the licensing leverage is worth more than the AI referrals. A tool, docs, or software brand might decide the exact opposite.

Another nuance: blocking the bots does not fix content already copied into older model weights or older Common Crawl snapshots. The value of blocking is future control, not retroactive erasure. That makes the decision more strategic than emotional. You are deciding how your site participates in the next wave of datasets and answer systems, not trying to rewrite the past.

What a publisher policy can look like in practice

Many sites now use a split policy that sounds like this: "We allow AI systems to fetch public pages for real-time user-requested citation, but we do not permit training, indexing for synthetic content generation, or bulk archival crawling outside the terms of service." That policy is then expressed three ways:

Machine-readable: robots.txt and platform-specific crawler tokens.
Technical enforcement: user-agent, rate-limit, and IP rules at the edge.
Legal language: terms of service or licensing terms that explicitly reserve rights.

The main benefit of writing the policy down is internal consistency. If legal, editorial, and engineering all use the same definition of "allowed" and "blocked," your controls stay coherent as new vendors and bots appear.

FAQ: practical questions site owners keep asking

Should I block ChatGPT-User or leave it allowed?

If you want your pages to be cited when a user asks ChatGPT to browse the web, leave ChatGPT-User allowed. If you view AI answer products as competitors or you have no measurable benefit from those referrals, block it. The important point is that this is a distribution decision, not a security decision. Training bots and live-browse bots solve different problems and should be evaluated separately.

Is robots.txt enough on its own?

It is enough for documented, well-behaved crawlers. It is not enough for spoofed bots, browser automation, scraping clusters that reuse consumer user agents, or any bot that simply ignores robots.txt. Think of robots.txt as the policy declaration. Enforcement still belongs at the CDN, reverse proxy, or WAF if abuse is a real cost.

Can I block by ASN instead of by user agent?

You can, but it is a blunt tool. Blocking entire cloud ASNs can catch AI traffic, but it also catches many legitimate services and users. ASN filtering is best used for rate limiting, staged mitigation, or emergency response when one network is obviously abusive. It is rarely the cleanest long-term control by itself.

What if the bot keeps changing user agents?

Once a scraper rotates user agents, you stop treating it as a declared AI crawler and start treating it as generic abuse. Shift from robots.txt and substring matches to behavioral controls: rate limits, bot scores, suspicious ASN rules, challenge pages, and path-based protections on expensive endpoints.

Will blocking AI bots save money immediately?

Often yes, especially on sites with many archive pages or expensive dynamic routes. The savings show up fastest when machine traffic was bypassing cache and hitting the origin repeatedly. The effect is smaller on highly cached static sites, but even there it can reduce log noise, bandwidth, and cache churn.

How often should I revisit the rule set?

Every quarter is a reasonable default. New bots appear, vendors rename tokens, live-browse products get introduced, and your own business goal can change. A publisher who wanted maximum exclusion in January might want selective inclusion in October if AI referrals become material.

Cloudflare-specific blocking recipes

Cloudflare hosts an enormous share of the web and exposes first-class tooling for AI bot control. If you are already behind Cloudflare, you have three practical options.

Verified Bot category rules

Cloudflare maintains a Verified Bots directory that labels traffic from documented AI crawlers. The quickest block is a rule like (cf.verified_bot_category in {"AI Crawler"}) set to Block or Managed Challenge. This covers the majority of well-behaved AI bots without you maintaining a user-agent list by hand. The tradeoff is that it is an opaque list; you trust Cloudflare to keep it current.

User-agent WAF rule

For explicit control, a custom WAF rule with (http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "PerplexityBot") set to Block gives a transparent, auditable deny. Add the other bots from the user-agent table as you decide your policy. Combine with a challenge action instead of block if you want to differentiate legitimate vs spoofed traffic.

AI Labyrinth

Cloudflare's AI Labyrinth is an opt-in feature that serves AI scrapers generated-content mazes to waste their compute budget. This is a more aggressive option than blocking; it keeps the bot engaged while ensuring they scrape no useful training data. Available on paid plans. Useful for sites that want to actively degrade unauthorized training rather than just deny.

Nginx rate-limiting patterns for persistent scrapers

User-agent rules stop obvious bots. Rate limits stop the aggressive ones that rotate identities. An effective pattern combines per-IP and per-ASN rate zones.

Example directive shape (not a drop-in config):

Define a per-IP zone: limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s.
Define a per-ASN zone using the GeoIP2 module with the ASN lookup: limit_req_zone $geoip2_asn_number zone=perasn:10m rate=50r/s.
Apply both in the location / block with limit_req zone=perip burst=20 nodelay; and limit_req zone=perasn burst=100 nodelay;.

This catches both single-IP hammering and distributed scraping from the same cloud ASN. Tune the rates to your site's legitimate traffic profile; the numbers above are illustrative.

Monitoring whether your blocks are working

A rule is not useful if you cannot verify it runs. Put a dashboard widget next to your WAF rule counts so you can see at a glance what is being blocked, by rule, over time.

Cloudflare: Analytics > Security > Events. Filter by service=firewall-rules, then by rule name. Look for a steady non-zero count on your AI blocks.
Fastly: Real-time analytics or the Security dashboard. Log entries with fastly_info.state showing BLOCKED are the ones to count.
AWS WAF: CloudWatch metrics under the WebACL namespace. Graph BlockedRequests per rule.
Self-hosted: Add a structured log field for the rule that matched, then graph it in Grafana or your log aggregator.

If a rule counts zero over a week, either the bot is not visiting you or the rule does not match what you think. Both cases warrant investigation. A silent WAF rule is almost always a broken WAF rule.

When legal opt-outs matter more than technical blocks

Technical blocks stop the current fetch. They do not remove your content from datasets already scraped. If your concern is training data in existing or future models, a legal signal may be more effective than a firewall rule.

EU TDM opt-out:the EU's Text and Data Mining directive lets rights-holders reserve against machine mining via a machine-readable signal. Add tdm-reservation: 1 header or an equivalent to robots.txt, or publish a site-wide TDM reservation statement. This is a legal-effect signal in EU jurisdictions even when the scraper ignores it.
Terms of service: an explicit clause that prohibits automated collection for model training gives you standing in contract claims even if technical blocks fail.
DMCA and safe-harbor requests: for content already ingested, a formal takedown or opt-out request to the model vendor is the correct channel. OpenAI, Anthropic, Google, and Perplexity all publish opt-out processes.

Related guides and tools

By Theodore Uzun

Founder and Senior Software Engineer, IP Trackers

Published April 18, 2026Editorially reviewed

Keep exploring

Reverse DNS (PTR) Lookup IP & DNS Glossary