Mitigating LLM Crawler Impact on SFMC CloudPages

Why Standard Solutions Fail:

Client-Side JS/Meta Tags: Standard advice often suggests using <meta name="robots" content="noindex, nofollow"> or client-side JavaScript redirects. By the time the bot downloads the page and the client-side code executes, the SFMC server has already registered the "Impression" (HTTP 200 OK), and the Super Message has been billed.
Standard robots.txt: Many modern LLM scrapers are aggressive and ignore standard robots.txt directives, which act merely as a "polite request" rather than a hard barrier.

The Defense-in-Depth Solution Architecture

To effectively stop surging artificial CloudPage views, we must implement a multi-layered approach that stops the bot as early in the request lifecycle as possible, using SFMC's specific architectural quirks.

Layer 1: Simulating a robots.txt file at the root domain to deter compliant bots before they request sub-pages.
Layer 2: Using Server-Side JavaScript (SSJS) to detect known bot User-Agents and kill the page load immediately with a 403 Forbidden status.
Layer 3: Migrating high-risk or data-processing endpoints from standard "Landing Pages" to "Code Resources" to bypass the Super Message impression billing mechanism entirely.

Technical Implementation Details

Layer 1: Simulating robots.txt via Code Resource

SFMC does not provide traditional FTP or file-system access to place a standard robots.txt file at the root of your domain. We must simulate this using a Web Studio Code Resource.

Implementation Steps:

Navigate to Web Studio > CloudPages.
Select your relevant Collection.
Click Add Content and select Code Resource.
Set the Type to Text.
CRITICAL: Set the URL routing to resolve as closely to yourdomain.com/robots.txt as your SAP/Domain configuration allows. (Note: If you cannot map this exactly to the root, Layer 2 becomes your primary defense).

Content for robots.txt Code Resource:

Depending on the purpose of your CloudPages, you have two strategic options for structuring this file:

Option A: The Blocklist Approach (Safer for Marketing Pages)

Use this approach if your CloudPages are public marketing assets that users might share on social media. It blocks known AI bots but allows everything else (like Facebook/LinkedIn link preview generators).

User-agent: GPTBot

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: anthropic-ai

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: GrokBot

Disallow: /

User-agent: xAI-Grok

Disallow: /

User-agent: *

Disallow: /private-cloud-pages/

Option B: The Allowlist Approach (Maximum Privacy)

Use this approach for data-processing pages, API endpoints, or preference centers where you want zero maintenance and maximum protection. WARNING: Because this blocks * (all bots), it will break rich social media link previews (Facebook, Slack, LinkedIn, etc.) unless you explicitly add their bots to the 'Allow' list.

# Disallow ALL bots by default to eliminate whack-a-mole maintenance

User-agent: *

Disallow: /

# Explicitly allow standard search engines for basic SEO

User-agent: Googlebot

Allow: /

User-agent: Bingbot

Allow: /

Layer 2 & 3: The Enforcer Script & The "Bunker" Migration

This is the most critical step for cost-saving.

The Architectural Shift (Layer 3):

Standard CloudPages ("Landing Pages") charge 1 Super Message per impression. Code Resources (designed for hosting CSS/JS/JSON) do not consume Super Messages for impressions.

Action: If your CloudPage is primarily processing form data, acting as an API endpoint, or doesn't require complex visual rendering for human consumption, migrate its logic into an HTML or JSON Code Resource.

The SSJS Enforcer Script (Layer 2):

Whether you keep the page as a standard Landing Page or move it to a Code Resource, you must protect the compute resources and data using this Server-Side JavaScript block.

Implementation Rules:

This code must be placed on Line 1 of your CloudPage/Code Resource.
There must be absolutely no whitespace, line breaks, or HTML tags before <script runat="server">. Any prior output will cause the HTTP headers to be sent, defeating the purpose of the script.
Alignment with Layer 1: If you chose "Option B" (Allowlist) for your robots.txt, you may also want to drastically shorten this SSJS list to simply block everything that doesn't explicitly declare itself as a browser, though that requires advanced RegEx not covered here. The list below acts as a strong safety net for "Option A".

Platform.Load("Core", "1");

try {

var userAgent = Platform.Request.GetRequestHeader("User-Agent");

// COMPREHENSIVE LLM & SCRAPER BLOCK LIST

var blockedAgents = [

// OpenAI

"GPTBot", "ChatGPT-User", "OAI-SearchBot", "chatgpt-operator",

// Anthropic

"anthropic-ai", "Claude-Web", "ClaudeBot", "Claude-User", "claude-code",

// Google (AI Specific - DO NOT BLOCK 'Googlebot' to preserve standard SEO)

"Google-Extended", "GoogleOther", "Google-Vertex-AI", "Gemini-Deep-Research",

// xAI / Grok

"GrokBot", "xAI-Grok", "Grok-DeepSearch",

// Perplexity AI

"PerplexityBot", "Perplexity-User",

// Aggressive Training Data Scrapers

"CCBot", "Bytespider", "Bytedance", "Diffbot", "ImagesiftBot",

"cohere-ai", "Omgilibot", "Omgili",

// Social & Other LLMs (Note: FacebookBot blocks social link previews)

"FacebookBot", "Meta-ExternalAgent", "Applebot-Extended",

"Amazonbot", "Scrapy", "Go-http-client"

];

if (userAgent && userAgent.length > 0) {

var uaLower = userAgent.toLowerCase();

for (var i = 0; i < blockedAgents.length; i++) {

if (uaLower.indexOf(blockedAgents[i].toLowerCase()) > -1) {

// 1. Set 403 Forbidden Header

Platform.Response.SetResponseHeader("HTTP/1.1", "403 Forbidden");

// 2. Write minimal response to save bandwidth

Write("Access Denied: Automated scraping is not permitted.");

// 3. HARD STOP - Kills page pipeline immediately

Platform.Function.RaiseError("Blocked AI Crawler: " + blockedAgents[i], false, "statusCode", "403");

}

} catch (e) {

// Fail open: If SSJS evaluation fails, allow the page to load

// to prevent accidental outages for legitimate human users.

}

</script>

<!DOCTYPE html>

<head>

...

Deep Dive: Technical Nuances of the SSJS Script

The RaiseError Function: Platform.Function.RaiseError is the "nuclear option" in SFMC. It instantly halts the execution pipeline. By stopping the pipeline before your AMPScript personalization strings evaluate, you save database processing power and ensure no sensitive CRM data is accidentally rendered to the bot.
SEO Preservation: You will note that Googlebot and Bingbot are omitted from the block list. Blocking these will de-index your CloudPages from standard search engines. We explicitly target Google-Extended (Google's AI training crawler) instead.
Link Previews: The inclusion of FacebookBot and similar social scrapers means that if a user pastes the CloudPage link into iMessage, WhatsApp, or Facebook, the rich "preview card" (title and image) will not generate. If your marketing strategy relies on social sharing of these specific pages, you must remove social bots from the array. Please note Social Sharing consumes super messages.
The Value of SSJS vs. robots.txt (The Spoofing Caveat): Perfectly compliant bots respect robots.txt. Conversely, truly malicious bots will simply spoof their User-Agent to look like a standard human browser (e.g., Chrome on iOS), rendering this SSJS check ineffective against them. However, this SSJS layer is still highly valuable for three reasons:
- Page-Level Granularity: It allows you to enforce bot controls on a specific page without editing the global robots.txt configuration.
- "Grey Area" Scrapers & Live Agents: Many aggressive training scrapers (like Bytespider) or real-time AI agents responding to a human prompt (like ChatGPT-User) may bypass robots.txt rules but will still declare their true User-Agent to avoid triggering IP bans from Edge WAFs. SSJS stops these successfully.
- SFMC Routing Failsafe: Depending on your SAP (Sender Authentication Package) setup, getting a Code Resource to resolve perfectly at domain.com/robots.txt can be difficult. If the bot receives a 404 because of SFMC routing quirks, this SSJS script acts as your ultimate failsafe to prevent artificial CloudPage views.

Alternative Architectural Consideration: Edge WAF

Customers who manage their own DNS for the domain used by the CloudPages (e.g., via Cloudflare, AWS WAF, or Akamai) rather than delegating it entirely to Salesforce via SAP, the ultimate zero-cost solution is Edge Blocking.

By applying WAF (Web Application Firewall) rules at the DNS level to drop requests based on the User-Agent string or known AI crawler IP Addresses before they ever route to the Salesforce data center, you substantially reduce artificial CloudPage views.

Mitigating LLM Crawler Impact on SFMC CloudPages

The Defense-in-Depth Solution Architecture

Technical Implementation Details

Layer 1: Simulating robots.txt via Code Resource

Option A: The Blocklist Approach (Safer for Marketing Pages)

Option B: The Allowlist Approach (Maximum Privacy)

Layer 2 & 3: The Enforcer Script & The "Bunker" Migration

Deep Dive: Technical Nuances of the SSJS Script

Alternative Architectural Consideration: Edge WAF

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List