Donate

AI Company IP Ranges 2026: GPTBot, ClaudeBot, CCBot Verified

This guide covers: AI Company IP Ranges 2026: GPTBot, ClaudeBot, CCBot Verified.

The major AI companies publishing crawler IP ranges in 2026 are OpenAI (GPTBot, OAI-SearchBot, ChatGPT-User), Anthropic (ClaudeBot), Perplexity (PerplexityBot, Perplexity-User), Google (Googlebot with the Google-Extended policy token), and Common Crawl (CCBot). Most other AI vendors rent compute from AWS, GCP, or Azure, so the IP address alone rarely proves the request. This guide walks through the published feeds, the ASN patterns, and the FCrDNS verification steps that actually hold up when you tune a WAF or audit logs.

AI company IP ranges and ASNs: verifying GPTBot, ClaudeBot, PerplexityBot, and CCBot with reverse DNS

Why AI company IPs are harder to pin down than search engine IPs

Google and Bing publish stable crawler IP ranges tied to their own ASNs. AI companies rarely do, because their crawlers run on AWS, GCP, Azure, and a handful of smaller clouds. A single IP can belong to OpenAI today and a completely unrelated startup next week. This means three things:

  • Do not rely on an IP list alone. Always combine IP with a reverse DNS check and, when possible, the documented user agent.
  • ASN filtering gets you most of the signal for training bots, because AI training runs on cloud compute at scale.
  • Live-browse agents (ChatGPT-User, Perplexity-User) often run from a small, published subset of IPs and are easier to verify.

OpenAI

OpenAI publishes three separate crawler lists so you can allow or block each one independently:

  • GPTBot - training crawler. Published at openai.com/gptbot.json.
  • OAI-SearchBot - search indexer for ChatGPT Search. Published at openai.com/searchbot.json.
  • ChatGPT-User - live browse agent. Published at openai.com/chatgpt-user.json.

These lists are small CIDR sets and OpenAI updates them periodically. Fetch the JSON directly rather than caching hardcoded ranges in your firewall.

Anthropic (Claude)

Anthropic publishes its crawler IPs as well. ClaudeBot, Claude-Web, and anthropic-ai are the user agents to match. The published CIDR list lives at anthropic.com/robots-allowlist.json. Anthropic's traffic historically originates from Google Cloud ASNs, which means pure ASN filtering will catch legitimate GCP traffic too.

Perplexity

Perplexity runs both an indexer (PerplexityBot) and a live-browse agent (Perplexity-User). They publish their IP ranges at perplexity.ai/perplexitybot.json and perplexity.ai/perplexity-user.json. The live-browse ranges are small and stable enough to whitelist.

Google (Gemini and AI Overviews)

Google does not run Gemini crawlers as a separate IP list. Training opt-out is handled through the Google-Extended robots token rather than a distinct crawler. AI Overviews and Gemini live-browse requests come from the same ranges as regular Google crawling, published at developers.google.com/search/apis/ipranges/googlebot.json and developers.google.com/search/apis/ipranges/special-crawlers.json.

Common Crawl

Common Crawl feeds many model training datasets indirectly. Its crawler is CCBot, and its traffic originates primarily from Amazon Web Services. The best verification is reverse DNS: all legitimate CCBot requests should resolve to crawl-*.commoncrawl.org.

Other significant AI crawlers

  • ByteDance: Bytespider. Uses ByteDance and cloud ranges. No published list.
  • Amazon: Amazonbot. Runs from AWS ranges (AS16509).
  • Meta: Meta-ExternalAgent, FacebookBot. Runs from Meta ASNs, primarily AS32934.
  • Apple: Applebot-Extended. Runs from Apple ASNs, primarily AS714 and AS6185.
  • Cohere: cohere-ai. No published IP list, runs from cloud compute.

ASN patterns: where AI traffic actually originates

When you see high-volume crawling and want to know whether it is an AI company, ASN context is more useful than an IP match. Run the suspect address through our ASN lookup and look for these patterns:

  • AS16509, AS14618 (Amazon): Common Crawl, Amazonbot, and many smaller AI projects.
  • AS15169 (Google): Googlebot, Google-Extended training signal, Anthropic production.
  • AS8075 (Microsoft): Bingbot, OpenAI production compute, Azure-hosted crawlers.
  • AS32934 (Meta): FacebookBot, Meta-ExternalAgent.
  • AS13335 (Cloudflare): Many smaller AI startups run through Cloudflare Workers.

Seeing AI-related user agents on AS16509 is entirely normal. Seeing the same user agent on AS9009 or a residential ASN is a strong signal of a spoofed bot.

Verifying an AI bot actually came from the claimed vendor

User agents are trivial to spoof, and IP lists drift. The method that actually holds up is forward-confirmed reverse DNS (FCrDNS), which is the same method Google recommends for verifying Googlebot:

  1. Take the source IP. Run a reverse DNS lookup to get the PTR hostname.
  2. Confirm the hostname belongs to the claimed vendor domain (for example, *.commoncrawl.org for CCBot, or an OpenAI-documented hostname).
  3. Run a forward DNS lookup on that hostname and confirm it resolves back to the same source IP.

If either step fails, the bot is spoofing. This workflow matters because real abuse often comes from residential proxy networks that mimic AI crawler user agents to bypass basic filters.

Practical log analysis recipe

The fastest way to understand AI traffic on your site is a short log audit. Aggregate your access logs over a week and group by ASN and user agent:

  1. Strip requests to static assets; focus on HTML hits.
  2. Group by source IP, then resolve each IP to an ASN with ASN lookup.
  3. Bucket by user agent substring: GPTBot, ClaudeBot, PerplexityBot, CCBot, Bytespider.
  4. Flag any bucket where the ASN does not match the expected vendor cloud. Those requests are likely spoofed.

Why published IP lists still are not enough on their own

The most important operational point in this topic is that an AI company's published IP list is a moving control, not a permanent identity. Security teams sometimes paste a JSON allowlist into a WAF, assume the problem is solved, and then discover a month later that the vendor added new ranges or shifted a service from one cloud to another. AI products move quickly, and infrastructure follows that pace.

There are three reasons the ranges change. First, vendors add or retire services: a training crawler, a live browsing agent, a search indexer, a synthetic evaluation worker, or an image fetcher can each end up with different egress. Second, they shift between clouds or regions to lower cost or increase capacity. Third, they split services by risk profile; a user-triggered browser agent is more likely to receive a stable and narrow range than a massive training pipeline that bursts across many regions.

This is why a mature workflow looks more like certificate management than static firewalling. You poll the published feed, compare it with the prior version, log the change, and then push the new data into the systems that need it. If you only copy ranges by hand into a document, the document goes stale almost immediately.

The reverse is also true: not every request from the published range is equally valuable. A live-browse range might deserve to be allowed while a training range is denied. The presence of a vendor-owned CIDR tells you who the request came from, but it does not tell you whether that request is strategically useful to your site.

How to interpret vendor JSON feeds correctly

When an AI company publishes a JSON list, read it like a machine interface, not a blog post. The most important fields are the network prefixes, the intended product, and any update timestamp or checksum that lets you detect drift. In practice, a good internal parser stores five pieces of information for each feed fetch:

  • The fetch time.
  • The feed URL and HTTP status.
  • The set of returned CIDRs.
  • The product label you map those CIDRs to internally.
  • Whether the set changed compared with the previous fetch.

This gives you version history, which matters when you are debugging a sudden surge in traffic or trying to explain why a previously blocked request is now allowed. If someone on the team asks, "Why did this OpenAI range bypass the firewall yesterday?" you want a concrete answer: the published feed changed at 02:14 UTC and the allowlist sync job updated the WAF at 02:30 UTC.

Do not assume the vendor will keep field names stable forever. Vendors are better at publishing a useful list than at freezing an API schema. Your internal importer should be resilient: it should fail closed on malformed data, alert on schema surprises, and keep the previous known good version if a fetch breaks.

OpenAI: the cleanest operational model right now

OpenAI is one of the easier vendors to work with because it separates products clearly. GPTBot is the training crawler,OAI-SearchBot is the search/indexing crawler, andChatGPT-User is the live browser used when a user asks the product to fetch a page. That split is operationally useful because you can make a nuanced policy. Many sites block GPTBot, reviewOAI-SearchBot, and allow ChatGPT-User for citation traffic.

The cleanest internal representation is to store OpenAI ranges by product, not just by company. If your WAF only knows "OpenAI" as a single label, you lose the ability to make a business distinction later. A publisher that wants to deny training but allow real-time answer traffic will need separate controls eventually.

When you verify an OpenAI request, stack the evidence in this order: documented user agent, source IP inside the published JSON feed, matching behavior for the product type, and expected request volume. For example, a single fetch from ChatGPT-User to one article after a user prompt looks normal. A flood of thousands of sequential article fetches from the same ChatGPT-Userrange does not. Product semantics matter.

OpenAI also illustrates why exact range management beats rough ASN logic. A lot of OpenAI infrastructure sits on Azure-adjacent or cloud-provider ranges. If you filter only on cloud ASN, you will mix OpenAI with unrelated services. The vendor feed is the safer authority here.

Anthropic: published ranges help, but cloud context still matters

Anthropic is similar to OpenAI in that it publishes usable crawler information, but operationally the cloud context matters more because Claude-related traffic has often been associated with large Google Cloud footprints. That means a naive ASN rule can create collateral damage quickly.

If you are trying to validate a Claude request, do not stop at "it came from GCP." That is too broad to mean much. Instead, match the Anthropic user-agent token, confirm the IP falls within the vendor feed you trust, then compare the request pattern with what the product claims to do. A live browsing fetch should not resemble a full-site mirror.

This is also a good example of why logging request paths matters. If a vendor claims the crawler is for documentation and answer citation but it is pulling every paginated archive on your site, you may want a tighter path-level control even if the identity is legitimate.

Perplexity: pay attention to the difference between index and user modes

Perplexity has become important because it sits closer to search than many other assistants. Its crawler model usually splits between an indexing component and a user-triggered browsing component. That split is strategically important for publishers because the referral value often comes from the user-triggered side, not the broad indexing side.

If you allow Perplexity at all, consider whether you want to allow the user mode but not the index mode, or vice versa. Some sites prefer to be discoverable in answer flows but do not want a wide crawl of the archive. Others want broad discoverability and accept the crawl cost. The right answer depends on whether the site's value lies in raw article reach, tool usage, subscription conversion, or licensing leverage.

Operationally, Perplexity is another vendor where published ranges are more useful than generic network heuristics. When a feed exists, use the feed. When it does not, treat the traffic as untrusted until user agent, reverse DNS, behavior, and volume all line up.

Google and the Gemini ecosystem: do not confuse crawler identity with policy tokens

Google remains the easiest place for teams to make mistakes because it has both traditional search crawling and AI-related controls. The most important distinction is this: Google-Extended is a policy token for training and generative use. It is not a standalone crawler identity in the way that Googlebot is. If you treatGoogle-Extended like a separate bot with separate IP ranges, you will misunderstand what your controls are doing.

Verification for Google traffic still follows the old and reliable pattern: reverse DNS, then forward-confirm the hostname, then compare against Google's published IP ranges for the relevant crawler class. This is why Google remains the benchmark for bot verification hygiene.

The practical implication is that you should not try to answer "Which Gemini IPs do I block?" as if Gemini were a single dedicated crawler estate. The more useful question is, "Which Google crawling uses do I allow, and how do I express my training preference?"

Common Crawl is still one of the most important sources to understand

Common Crawl deserves more attention than it usually gets because it is often an indirect source for model training. A site can block the most famous AI-company bots and still end up widely represented in downstream datasets if CCBot is left alone. For many publishers, blocking or tightly limiting Common Crawl is the single highest-impact move they can make after the headline AI bots.

Common Crawl is also a good lesson in verification discipline. The user agent alone is not enough. The hostname and reverse DNS pattern are a meaningful part of trust. If a request says CCBot but the source cannot be tied back to the expected domain pattern, treat it as spoofing until proven otherwise.

Microsoft, Meta, Apple, Amazon, ByteDance, and smaller vendors

Not every AI-adjacent request will come from the five names most commonly discussed. Copilot and Bing-related traffic can overlap with Microsoft search and Azure-hosted workloads. Meta and Apple often show up in content-fetching or preview-related contexts as well as AI contexts. Amazonbot may matter if you operate product or marketplace content. ByteDance, Cohere, Diffbot, and smaller vendors can appear in bursts depending on niche and region.

The operational pattern is the same for all of them: document what the user agent claims to be, tie the source back to either published ranges or trustworthy vendor-controlled DNS, and then compare the observed behavior with the stated purpose. A social unfurler that fetches one page when shared is different from a systematic crawler walking every page on the domain.

In other words, the problem is not only "Which company owns this IP?" The problem is "Which product is making this request, from which network, for what purpose, and at what volume?" That fuller question is what leads to sound firewall policy.

Building an internal source-of-truth for AI crawler identity

Once AI-related traffic matters to your site, move the knowledge out of individual heads and into a maintained internal inventory. That inventory can live in a simple JSON or YAML file, a spreadsheet, or a small internal table, but it should answer the same fields every time:

  • The company name.
  • The product or crawler identity.
  • The user-agent token(s).
  • The feed URL if published.
  • The expected reverse-DNS domain pattern.
  • The expected ASN or cloud family.
  • Your current policy: allow, throttle, block, or review.
  • The reason for that policy.

Once you have this, everything gets easier. WAF rules become generated output instead of hand-edited guesses. Support can answer why a request was blocked. Legal can review the policy in a human-readable form. Engineering can compare newly observed traffic against a stable baseline.

This is especially useful when someone wants to make an exception. "Please allow this vendor for the docs section only" becomes a trackable policy change rather than a one-off firewall edit that nobody remembers three months later.

When to use CIDR allowlists, ASN heuristics, or reverse DNS

Different controls are useful for different confidence levels. Exact CIDR allowlists are strongest when the vendor publishes them. ASN heuristics are useful when there is no clean feed, but they are much broader and riskier. Reverse DNS is an excellent verification layer, especially when the vendor has stable hostname patterns, but it is not always available or sufficient on its own.

A good rule of thumb:

  • Published CIDR list exists: use that as the primary identity control.
  • No CIDR list, but stable hostname pattern exists:verify with reverse DNS and forward-confirm.
  • No clean vendor signal exists: use ASN and behavior only as suspicion signals, not proof.

This keeps your policy from overclaiming certainty. The worst internal security artifacts are the ones that pretend weak evidence is strong evidence. "Likely AI-related cloud traffic" is an honest label. "Definitely vendor X" is not, unless you actually have the proof.

Operational playbooks for different teams

For publishers

Publishers care about three things: infrastructure cost, training control, and referral value from AI answer engines. The usual playbook is to block or throttle training and archive-heavy crawling, allow a narrow set of live-browse products, and review referrals monthly to see whether the allowed traffic is worth keeping.

For SaaS documentation teams

Docs teams often reach the opposite conclusion. They want content cited inside assistants because that exposure can drive adoption. The better workflow for them is to allow most verified live-browse and index bots, block obviously abusive mirrors, and keep path-specific rate limits on search, internal APIs, or very expensive rendered pages.

For fraud and abuse teams

Fraud teams usually care less about the business relationship and more about whether the request is genuinely from the claimed vendor. Their workflow centers on evidence quality: feed membership, FCrDNS, behavioral plausibility, request rate, session correlation, and whether the same actor is rotating identities to evade policy.

Common mistakes when interpreting AI-company traffic

  • Confusing cloud ownership with product ownership. An AWS IP is not automatically Amazonbot, and an Azure IP is not automatically OpenAI.
  • Assuming one policy fits every product. Training and live-browse products often deserve different treatment.
  • Not versioning published lists. Without a history, you cannot explain drift or debug unexpected matches.
  • Treating reverse DNS as optional. It is one of the best confidence boosters you have.
  • Ignoring path distribution. Identity is only half the story; what the bot is crawling matters just as much.

FAQ: the questions teams keep asking

Can one IP belong to more than one AI company over time?

Yes. On rented cloud infrastructure, an IP can be reassigned after a service changes region, scales down, or rotates pools. That is why historic screenshots of a reverse lookup are weak evidence. Use current feeds and current verification, not old notes.

Should I block every request from a cloud ASN to stop AI crawlers?

Usually no. That is a coarse emergency control, not a smart default. It will catch unrelated SaaS traffic, monitoring, integrations, and sometimes legitimate users behind enterprise egress. Use it only when you understand the collateral damage.

What is the strongest single proof that a request is genuine?

The strongest proof is convergence: published range membership, documented user agent, reverse DNS consistent with the vendor, forward DNS confirmation, and behavior that matches the product's stated purpose. One signal alone is rarely enough.

How often should I refresh AI-company range data?

Daily is a reasonable default for automated sync. Weekly can work for smaller teams if you also alert on fetch failures. Anything less frequent starts to feel stale for fast-moving products.

What if the vendor publishes no list at all?

Then you downgrade your certainty. Use user-agent, reverse DNS, path behavior, ASN context, and rate profile to decide whether the traffic is suspicious or useful, but do not present the outcome internally as hard attribution when it is not.

Is the goal always to block?

No. The real goal is to understand and control. For some sites the best outcome is a well-maintained allowlist. For others it is a selective deny policy. Good attribution supports both directions.

Implementation patterns that work in production

Once you stop treating AI-company IP data as an interesting research question and start treating it as an operational input, three patterns show up again and again. The first is the sync job that pulls published feeds and stores a versioned internal copy. The second is the policy engine that translates those identities into allow, block, or rate-limit controls. The third is the verification layer that checks whether observed traffic still matches the expected vendor behavior.

Sites that skip the first pattern usually end up with stale notes and one-off firewall edits. Sites that skip the second pattern collect good attribution data but never turn it into decisions. Sites that skip the third pattern trust spoofable user agents too much and get burned by fake bots. The operational sweet spot is to maintain all three.

Pattern 1: the daily feed sync

Run a small daily job that fetches each published vendor JSON, validates the response, normalizes the CIDRs, and writes the result into a stable internal format. If a vendor returns malformed JSON or times out, the job should alert and preserve the last known good set instead of deleting everything.

Pattern 2: policy at the edge

The internal feed becomes useful only when it powers the edge. That can mean Cloudflare IP lists, Fastly edge dictionaries, AWS WAF IP sets, or Nginx include files. The key is to let the policy engine target individual products, not just companies. That way you can allow OpenAI's user-triggered browser while denying GPTBot, or rate-limit one vendor without touching another.

Pattern 3: verification on log ingest

A mature logging pipeline tags incoming requests with the closest known match: vendor feed hit, expected ASN, reverse-DNS confidence, and user agent category. That lets analysts answer questions later without redoing every lookup manually. It also gives you a way to identify traffic that looks like a vendor on one dimension but not the others.

What to store for each suspected AI request

If you are doing log analysis or incident review, keep the evidence bundle together. The minimum useful record for a suspected AI request includes:

  • Timestamp
  • Source IP and port
  • User agent
  • Path and query
  • Matched vendor feed, if any
  • ASN and organization name
  • PTR hostname
  • Forward-confirmation result
  • Action taken: allowed, challenged, throttled, blocked

This sounds like overkill until you have to explain a bad block to a partner or debug why one product started behaving differently after a vendor rotated ranges. Good evidence storage turns a vague crawler problem into an auditable operational process.

Three real-world workflows

Workflow 1: publisher protecting original articles

A publisher usually wants to separate high-value editorial content from low-value crawl waste. The best workflow is to poll published feeds, block known training products, allow a narrower set of browse products, and then review whether the allowed products actually send referral traffic. If they do not, the publisher can tighten policy later without guessing.

Workflow 2: SaaS docs team that wants citations

A docs team often wants the opposite. They want answer engines to fetch their documentation because citations and code examples can generate adoption. Their workflow is to allow verified live-browse and index products, block abusive mirrors, and keep stronger rate limits on search and API-like paths that are expensive to serve.

Workflow 3: abuse or fraud analyst

A fraud team is less interested in content strategy and more interested in whether a claimed identity is genuine. Their workflow starts with the source IP and ends with a confidence statement: feed match, reverse DNS, forward confirmation, ASN plausibility, request rate, and whether the path pattern aligns with the claimed product. The output is not always a block. Sometimes it is just better confidence in the traffic label.

How to maintain range data without over-engineering it

Teams sometimes swing between two bad extremes: completely manual maintenance or an elaborate internal service no one can support. The middle ground is usually best. A scheduled job, a plain data format, a versioned history, and a simple push into your edge provider are enough for most organizations.

The key design choice is ownership. Decide who owns changes when a new vendor appears, who validates a suspected feed, and who can change the allow/block policy. If no one owns the workflow, it quietly rots.

Final rule of thumb

If a vendor publishes exact ranges, use them. If they do not, do not pretend rough cloud ownership is the same thing. Treat AI-company IP attribution as a layered confidence problem, not as a single magical lookup. The teams that do this well end up with fewer false positives, cleaner WAF policies, and much more confidence in what their logs are actually telling them.

A short checklist you can apply today

  1. List the AI vendors and products you actually care about.
  2. Pull published feeds where they exist and version them internally.
  3. Match feeds to user agents and expected reverse-DNS patterns.
  4. Separate allow, throttle, and block policies by product, not just company.
  5. Tag suspected AI traffic in logs with evidence quality, not guesswork.
  6. Revisit the inventory quarterly, because the network story changes fast.

That small process will usually do more for accuracy than chasing one more static IP list from a forum post or a social thread. Good AI crawler attribution is mostly disciplined operations, not secret data.

In other words, the winning habit is not memorizing more CIDRs. It is keeping a repeatable verification workflow that stays current as vendor infrastructure changes.

If you can consistently answer who the request likely came from, what product it belongs to, how confident you are, and what policy should apply, you already have the foundation most teams need.

That foundation is what turns AI-company IP analysis from a recurring argument into a routine operational control.

Reference: user agents and expected origins

Keep this table handy for log triage. A mismatch between user agent and origin is the fastest way to spot spoofing.

VendorUser agent tokenExpected PTR patternTypical ASNFeed URL
OpenAIGPTBot*.openai.comAS8075, AS400645openai.com/gptbot.json
OpenAIOAI-SearchBot*.openai.comAS8075, AS400645openai.com/searchbot.json
OpenAIChatGPT-User*.openai.comAS8075, AS400645openai.com/chatgpt-user.json
AnthropicClaudeBot, anthropic-ai(no stable pattern)AS15169, AS32748anthropic.com/robots-allowlist.json
PerplexityPerplexityBot*.perplexity.aiAS14061, AS16509perplexity.ai/perplexitybot.json
PerplexityPerplexity-User*.perplexity.aiAS14061, AS16509perplexity.ai/perplexity-user.json
GoogleGooglebot (with Google-Extended policy)*.googlebot.com, *.google.comAS15169developers.google.com/search/apis/ipranges/googlebot.json
Common CrawlCCBotcrawl-*.commoncrawl.orgAS16509(no feed; use PTR)
ByteDanceBytespider(irregular)AS55967, AS396986(no feed)
AmazonAmazonbot*.crawl.amazonbot.amazonAS16509(no feed; use PTR)
MetaMeta-ExternalAgent, FacebookBot*.fbsbx.com, *.facebook.comAS32934(no feed; use PTR)
AppleApplebot-Extended*.applebot.apple.comAS714, AS6185(no feed; use PTR)

FCrDNS verification, the actual commands

The reverse-then-forward check is not complicated; most people simply have not run it before. Here is the exact sequence, on every major platform.

Linux and macOS

dig -x 203.0.113.42 returns the PTR record. Note the hostname. Then dig +short crawl-203-0-113-42.commoncrawl.org returns the A record(s). If the A record matches the original IP, FCrDNS passes. If it does not match, the PTR is not authoritative.

Windows

nslookup 203.0.113.42 returns the PTR. Then nslookup the-hostname-returned returns the A record. Compare.

Automated (bash one-liner)

IP=203.0.113.42; PTR=$(dig +short -x $IP); FWD=$(dig +short $PTR); [ "$FWD" = "$IP" ] && echo PASS || echo FAIL

At scale in log ingest

For production, cache the FCrDNS result per IP for 24 hours. Doing a fresh DNS query on every request adds unacceptable latency and exposes your infrastructure to DNS-based side effects. Most edge providers (Cloudflare, Fastly, Akamai) offer built-in verified-bot lists that handle this cache for you. Use those when available and only fall back to manual FCrDNS for vendors the edge provider does not cover.

Sample sync job for the daily feed pull

A minimal implementation is maybe 30 lines of Python or 50 lines of Go. The shape is: fetch each vendor feed, parse the CIDRs, compare with the previous set, log the diff, push the new set into your edge provider via API.

Pseudocode outline:

  1. For each vendor in your inventory:
  2. Fetch the feed URL with a 10-second timeout.
  3. On failure, log and keep the previous known-good set.
  4. On success, parse the CIDRs from the documented JSON schema.
  5. Diff against the stored previous set; log added and removed.
  6. Store the new set with timestamp and source URL.
  7. Push the new set to your WAF via API (Cloudflare Rulesets, AWS WAF IP Sets, Fastly ACLs, etc).
  8. Alert on schema changes or unexpectedly large diffs (e.g. 50% of ranges changed in a single poll).

Run this hourly or daily depending on how quickly you need to react to vendor changes. Store the feed history indefinitely; it is inexpensive and invaluable for debugging.

Edge-provider comparison for AI-bot handling

Each major edge provider exposes a different level of built-in AI-bot intelligence. Before you build your own feed-sync pipeline, check what you get for free.

Cloudflare

Cloudflare maintains a Verified Bots directory that labels AI crawlers with categorized tags. WAF rules can match on cf.verified_bot_categoryfor "AI Crawler" without maintaining your own list. For deeper control, Cloudflare also exposes per-vendor lists through Cloudflare Radar and offers AI Audit, which shows which AI products fetched your site and lets you toggle policy per vendor. This is the most mature built-in offering in 2026.

Fastly

Fastly's Next-Gen WAF (Signal Sciences) categorizes AI crawlers and exposes policy hooks. Fastly ACLs accept CIDR lists via API, so a feed-sync job can push vendor JSON into an ACL directly. No out-of-the-box AI Audit equivalent, but the primitives are flexible.

AWS WAF

AWS WAF does not ship an AI-bot category natively. The AWS Managed Bot Control rule group includes verified-bot tagging but lacks vendor-specific labels. The practical approach is to maintain your own IP Sets populated from vendor feeds via a Lambda cron, then reference those IP Sets in rate-based or match rules.

Akamai

Akamai's Bot Manager has categorized AI bots for several years. Its per-vendor visibility is strong and its challenge options are granular. The tradeoff is cost - Bot Manager is an enterprise product.

Self-hosted (Nginx, HAProxy, Caddy)

Self-hosted stacks have no built-in AI-bot intelligence; you build it yourself. The sync-job pattern described above is the reference architecture. Tools like ipset on Linux give you efficient CIDR-set matching in iptables or nftables without hitting the slow-path for every request.

When to use third-party bot intelligence feeds

If maintaining your own vendor-feed sync pipeline is more than your team wants to own, third-party bot intelligence feeds (DataDome, PerimeterX/HUMAN, Kasada, Arkose Labs) aggregate AI-bot signatures across many customer sites and sell a managed product. The tradeoff is cost and vendor lock-in; the upside is that their models adapt to emerging scrapers faster than an internal sync job does.

For mid-market publishers with serious scraping pain, a managed product often pays for itself in reduced operational burden. For smaller sites, the built-in capabilities of Cloudflare or Fastly usually suffice.

Related guides and tools

Keep exploring

DNS Lookup ToolReverse DNS (PTR) LookupASN Lookup
PreviousHow AI Knows Your Location from IP and PhotosNextHow to Block AI Scrapers: GPT, Claude and Perplexity Bots

Related reading

What Is a Metropolitan Area Network (MAN)?9 min read - April 4, 2026What Is a Computer Network? Types, Components, and How They Work12 min read - April 4, 2026What Is a Local Area Network (LAN)? How LANs Work10 min read - April 4, 2026What Is WiFi? How Wireless Networks Work Explained11 min read - April 4, 2026What Is a WAN? Wide Area Networks Explained10 min read - April 4, 2026Reverse Phone Lookup: Identify Unknown Callers and Avoid Scams7 min read - April 4, 2026