The five infrastructure gates behind crawl, render, and index

The DSCRI-ARGDW pipeline maps 10 gates between your content and an AI recommendation across two phases: infrastructure and competitive. Because confidence multiplies across the pipeline, the weakest gate is always your biggest opportunity. Here, we focus on the first five gates.
The infrastructure phase (discovery through indexing) is a sequence of absolute tests: the system either has your content, or it doesn’t. Then, as you pass through the gates, there’s degradation.
For example, a page that can’t be rendered doesn’t get “partially indexed,” but it may get indexed with degraded information, and every competitive gate downstream operates on whatever survived the infrastructure phase.

If the raw material is degraded, the competition in the ARGDW phase starts with a handicap that no amount of content quality can overcome.
The industry compressed these five distinct DSCRI gates into two words: “crawl and index.” That compression hides five separate failure modes behind a single checkbox. This piece breaks the simplistic “crawl and index” into five clear gates that will help you optimize significantly more effectively for the bots.
If you’re a technical SEO, you might feel you can skip this. Don’t.
You’re probably doing 80% of what follows and missing the other 20%. The gates below provide measurable proof that your content reached the index with maximum confidence, giving it the best possible chance in the competitive ARGDW phase that follows.
Sequential dependency: Fix the earliest failure first
The infrastructure gates are sequential dependencies: each gate’s output is the next gate’s input, and failure at any gate blocks everything downstream.
If your content isn’t being discovered, fixing your rendering is wasted effort, and if your content is crawled but renders poorly, every annotation downstream inherits that degradation. Better to be a straight C student than three As and an F, because the F is the gate that kills your pipeline.
The audit starts with discovery and moves forward. The temptation to jump to the gate you understand best (and for many technical SEOs, that’s crawling) is the temptation that wastes the most money.
Discovery, selection, crawling: The three gates the industry already knows
Discovery and crawling are well-understood, while selection is often overlooked.
Discovery is an active signal. Three mechanisms feed it:
- XML sitemaps (the census).
- IndexNow (the telegraph).
- Internal linking (the road network).
The entity home website is the primary discovery anchor for pull discovery, and confidence is key. The system asks not just “does this URL exist?” but “does this URL belong to an entity I already trust?” Content without entity association arrives as an orphan, and orphans wait at the back of the queue.
The push layer (IndexNow, MCP, structured feeds) changes the economics of this gate entirely, and I’ll explain what changes when you stop waiting to be found and start pushing.
Selection is the system’s opinion of you, expressed as crawl budget. As Microsoft Bing’s Fabrice Canel says, “Less is more for SEO. Never forget that. Less URLs to crawl, better for SEO.”
The industry spent two decades believing more pages equals more traffic. In the pipeline model, the opposite is true: fewer, higher-confidence pages get crawled faster, rendered more reliably, and indexed more completely. Every low-value URL you ask the system to crawl is a vote of no confidence in your own content, and the system notices.
Not every page that’s discovered in the pull model is selected. Canel states that the bot assesses the expected value of the destination page and will not crawl the URL if the value falls below a threshold.
Crawling is the most mature gate and the least differentiating. Server response time, robots.txt, redirect chains: solved problems with excellent tooling, and not where the wins are because you and most of your competition have been doing this for years.
What most practitioners miss, and what’s worth thinking about: Canel confirmed that context from the referring page carries forward during crawling.
Your internal linking architecture isn’t just a crawl pathway (getting the bot to the page) but a context pipeline (telling the bot what to expect when it arrives), and that context influences selection and then interpretation at rendering before the rendering engine even starts.
Rendering fidelity: The gate that determines what the bot sees
Rendering fidelity is where the infrastructure story diverges from what the industry has been measuring.
After crawling, the bot attempts to build the full page. It sometimes executes JavaScript (don’t count on this because the bot doesn’t always invest the resources to do so), constructs the document object model (DOM), and produces the rendered DOM.
I coined the term rendering fidelity to name this variable: how much of your published content the bot actually sees after building the page. Content behind client-side rendering that the bot never executes isn’t degraded, it’s gone, and information the bot never sees can’t be recovered at any downstream gate.
Every annotation, every grounding decision, every display outcome depends on what survived rendering. If rendering is your weakest gate, it’s your F on the report card, and remember: everything downstream inherits that grade.
The friction hierarchy: Why the bot renders some sites more carefully than others
The bot’s willingness to invest in rendering your page isn’t uniform. Canel confirmed that the more common a pattern is, the less friction the bot encounters.
I’ve reconstructed the following hierarchy from his observations. The ranking is my model. The underlying principle (pattern familiarity reduces selection, crawl, rendering, and indexing friction and processing cost) is confirmed:
| Approach | Friction level | Why |
| WordPress + Gutenberg + clean theme | Lowest | 30%+ of the web. Most common pattern. Bot has highest confidence in its own parsing. |
| Established platforms (Wix, Duda, Squarespace) | Low | Known patterns, predictable structure. Bot has learned these templates. |
| WordPress + page builders (Elementor, Divi) | Medium | Adds markup noise. Downstream processing has to work harder to find core content. |
| Bespoke code, perfect HTML5 | Medium-High | Bot does not know your code is perfect. It has to infer structure without a pattern library to validate against. |
| Bespoke code, imperfect HTML5 | High | Guessing with degraded signals. |
The critical implication, also from Canel, is that if the site isn’t important enough (low publisher entity authority), the bot may never reach rendering because the cost of parsing unfamiliar code exceeds the estimated benefit of obtaining the content. Publisher entity confidence has a huge influence on whether you get crawled and also how carefully you get rendered (and everything else downstream).
JavaScript is the most common rendering obstacle, but it isn’t the only one: missing CSS, proprietary elements, and complex third-party dependencies can all produce the same result — a bot that sees a degraded version of what a human sees, or can’t render the page at all.
JavaScript was a favor, not a standard
Google and Bing render JavaScript. Most AI agent bots don’t. They fetch the initial HTML and work with that. The industry built on Google and Bing’s favor and assumed it was a standard.
Perplexity’s grounding fetches work primarily with server-rendered content. Smaller AI agent bots have no rendering infrastructure.
The practical consequence: a page that loads a product comparison table via JavaScript displays perfectly in a browser but renders as an empty container for a bot that doesn’t execute JS. The human sees a detailed comparison. The bot sees a div with a loading spinner.
The annotation system classifies the page based on an empty space where the content should be. I’ve seen this pattern repeatedly in our database: different systems see different versions of the same page because rendering fidelity varies by bot.
Three rendering pathways that bypass the JavaScript problem
The traditional rendering model assumes one pathway: HTML to DOM construction. You now have two alternatives.

WebMCP, built by Google and Microsoft, gives agents direct DOM access, bypassing the traditional rendering pipeline entirely. Instead of fetching your HTML and building the page, the agent accesses a structured representation of your DOM through a protocol connection.
With WebMCP, you give yourself a huge advantage because the bot doesn’t need to execute JavaScript or guess at your layout, because the structured DOM is served directly.
Markdown for Agents uses HTTP content negotiation to serve pre-simplified content. When the bot identifies itself, the server delivers a clean markdown version instead of the full HTML page.
The semantic content arrives pre-stripped of everything the bot would have to remove anyway (navigation, sidebars, JavaScript widgets), which means the rendering gate is effectively skipped with zero information loss. If you’re using Cloudflare, you have an easy implementation that they launched in early 2026.
Both alternatives change the economics of rendering fidelity in the same way that structured feeds change discovery: they replace a lossy process with a clean one.
For non-Google bots, try this: disable JavaScript in your browser and look at your page, because what you see is what most AI agent bots see. You can fix the JavaScript issue with server-side rendering (SSR) or static site generation (SSG), so the initial HTML contains the complete semantic content regardless of whether the bot executes JavaScript.
But the real opportunity lies in new pathways: one architectural investment in WebMCP or Markdown for Agents, and every bot benefits regardless of its rendering capabilities.
Conversion fidelity: Where HTML stops being HTML
Rendering produces a DOM. Indexing transforms that DOM into the system’s proprietary internal format and stores it. Two things happen here that the industry has collapsed into one word.
Rendering fidelity (Gate 3) measures whether the bot saw your content. Conversion fidelity (Gate 4) measures whether the system preserved it accurately when filing it away. Both losses are irreversible, but they fail differently and require different fixes.
The strip, chunk, convert, and store sequence
What follows is a mechanical model I’ve reconstructed from confirmed statements by Canel and Gary Illyes.
Strip: The system removes repeating elements: navigation, header, footer, and sidebar. Canel confirmed directly that these aren’t stored per page.
The system’s primary goal is to find the core content. This is why semantic HTML5 matters at a mechanical level. <nav>, <header>, <footer>, <aside>, <main>, and <article> tags tell the system where to cut. Without semantic markup, it has to guess.
Illyes confirmed at BrightonSEO in 2017 that finding core content at scale was one of the hardest problems they faced.
Chunk: The core content is broken into segments: text blocks, images with associated text, video, and audio. Illyes described the result as something like a folder with subfolders, each containing a typed chunk (he probably used the term “passage” — potato, potarto, tomato, tomarto). The page becomes a hierarchical structure of typed content blocks.
Convert: Each chunk is transformed into the system’s proprietary internal format. This is where semantic relationships between elements are most vulnerable to loss.
The internal format preserves what the conversion process recognizes, and everything else is discarded.
Store: The converted chunks are stored in a hierarchical structure.

The individual steps are confirmed. The specific sequence and the wrapper hierarchy model are my reconstruction of how those confirmed pieces fit together.
In this model, the repeating elements stripped in the first step are not discarded but stored at the appropriate wrapper level: navigation at site level, category elements at category level. The system avoids redundancy by storing shared elements once at the highest applicable level.
Like my “Darwinism in search” piece from 2019, this is a well-informed, educated guess. And I’m confident it will prove to be substantively correct.
The wrapper hierarchy changes three things you already do:
URL structure and categorization: Because each page inherits context from its parent category wrapper, URL structure determines what topical context every child page receives during annotation (the first gate in the phase I’ll cover in the next article: ARGDW).
A page at /seo/technical/rendering/ inherits three layers of topical context before the annotation system reads a single word. A page at /blog/post-47/ inherits one generic layer. Flat URL structures and miscategorized pages create annotation problems that might appear to be content problems.
Breadcrumbs validate that the page’s position in the wrapper hierarchy matches the physical URL structure (i.e., match = confidence, mismatch = friction). Breadcrumbs matter even when users ignore them because they’re a structural integrity signal for the wrapper hierarchy.
Meta descriptions: Google’s Martin Splitt suggested in a webinar with me that the meta description is compared to the system’s own LLM-generated summary of the page. If they match, a slight confidence boost. If they diverge, no penalty, but a missed validation opportunity.
Where conversion fidelity fails
Conversion fidelity fails when the system can’t figure out which parts of your page are core content, when your structure doesn’t chunk cleanly, or when semantic relationships fail to survive format conversion.
The critical downstream consequence that I believe almost everyone is missing: indexing and annotation are separate processes.
A page can be indexed but poorly annotated (stored but semantically misclassified). I’ve watched it happen in our database: a page is indexed, it’s recruited by the algorithmic trinity, and yet the entity still gets misrepresented in AI responses because the annotation was wrong.
The page was there. The system read it. But it read a degraded version (rendering fidelity loss at Gate 3, conversion fidelity loss at Gate 4) and filed it in the wrong drawer (annotation failure at Gate 5).
Processing investment: Crawl budget was only the beginning
The industry built an entire sub-discipline around crawl budget. That’s important, but once you break the pipeline into its five DSCRI gates, you see that it’s just one piece of a larger set of parameters: every gate consumes computational resources, and the system allocates those resources based on expected return. This is my generalization of a principle Canel confirmed at the crawl level.
| Gate | Budget type | What the system asks |
| 1 (Selected) | Crawl budget | “Is this URL a candidate for fetching?” |
| 2 (Crawled) | Fetch budget | “Is this URL worth fetching?” |
| 3 (Rendered) | Render budget | “Is this page a candidate for rendering?” |
| 4 (Indexed) | Chunking/conversion budget | “Is this content worth carefully decomposing?” |
| 5 (Annotated) | Annotation budget | “Is this content worth classifying across all dimensions?” |
Each budget is governed by multiple factors:
- Publisher entity authority (overall trust).
- Topical authority (trust in the specific topic the content addresses).
- Technical complexity.
- The system’s own ROI calculation against everything else competing for the same resource.
The system isn’t just deciding whether to process but how much to invest. The bot may crawl you but render cheaply, render fully but chunk lazily, or chunk carefully but annotate shallowly (fewer dimensions). Degradation can occur at any gate, and the crawl budget is just one example of a general principle.
Structured data: The native language of the infrastructure gates
The SEO industry’s misconceptions about structured data run the full spectrum:
- The magic bullet camp that treats schema as the only thing they need.
- The sticky plaster camp that applies markup to broken pages, hoping it compensates for what the content fails to communicate.
- The ignore-it-entirely camp that finds it too complicated or simply doesn’t believe it moves the needle.
None of those positions is quite right.
Structured data isn’t necessary. The system can — and does — classify content without it. But it’s helpful in the same way the meta description is: it confirms what the system already suspects, reduces ambiguity, and builds confidence.
The catch, also like the meta description, is that it only works if it’s consistent with the page. Schema that contradicts the content doesn’t just fail to help: it introduces a conflict the system has to resolve, and the resolution rarely favors the markup.
When the bot crawls your page, structured data requires no rendering, interpretation, or language model to extract meaning. It arrives in the format the system already speaks: explicit entity declarations, typed relationships, and canonical identifiers.
In my model, this makes structured data the lowest-friction input the system processes, and I believe it’s processed before unstructured content because it’s machine-readable by design. Semantic HTML tells the system which parts carry the primary semantic load, and semantic structure is what survives the strip-and-chunk process best because it maps directly to the internal representation.
Schema at indexing works the same way: instead of requiring the annotation system to infer entity associations and content types from unstructured text, schema declares them explicitly, like a meta description confirming what the page summary already suggested.
The system compares, finds consistency, and confidence rises. The entire pipeline is a confidence preservation exercise: pass each gate and carry as much confidence forward as possible. Schema is one of the cleaner tools for protecting that confidence through the infrastructure phase.
That said, Canel noted that Microsoft has reduced its reliance on schema. The reasons are worth understanding:
- Schema is often poorly written.
- It has attracted spam at a scale reminiscent of keyword stuffing 25 years ago.
- Small language models are increasingly reliable at inferring what schema used to need to declare explicitly.
Schema’s value isn’t disappearing, but it’s shifting: the signal matters most where the system’s own inference is weakest, and least where the content is already clean, well-structured, and unambiguous.
Schema and HTML5 have been part of my work since 2015, and I’ve written extensively about them over the years. But I’ve always seen structured data as one tool among many for educating the algorithms, not the answer in itself. That distinction matters enormously.
Brand is the key, and for me, always has been.
Without brand, all the structured data in the world won’t save you. The system needs to know who you are before it can make sense of what you’re telling it about yourself.
Schema describes the entity and brand establishes that the entity is worth describing. Get that order wrong, and you’re decorating a house the system hasn’t decided to visit yet.
The practical reframe: structured data implementation belongs in the infrastructure audit, and it’s the format that makes feeds and agent data possible in the first place. But it’s a confirmation layer, not a foundation, and the system will trust its own reading over yours if the two diverge.
Why improve infrastructure when you can skip them entirely?
The multiplicative nature of the pipeline means the same logic that makes your weakest gate your biggest problem also makes gate-skipping your biggest opportunity.
If every gate attenuates confidence, removing a gate entirely doesn’t just save you from one failure mode: it removes that gate’s attenuation from the equation permanently.
To make that concrete, here’s what the math looks like across seven approaches. The base case assumes 70% confidence at every gate, producing a 16.8% surviving signal across all five in DSCRI. Where an approach improves a gate, I’ve used 75% as the illustrative uplift.
These are invented numbers, not measurements. The point is the relative improvement, not the figures themselves.

| Approach | What changes | Entering ARGDW with |
| Pull (crawl) | Nothing | 16.8% |
| Schema markup | I → 75% | 18.0% |
| WebMCP | R skipped | 24.0% |
| IndexNow | D skipped, S → 75% | 25.7% |
| IndexNow + WebMCP | D skipped, S → 75%, R skipped | 36.8% |
| Feed (Merchant Center, Product Feed) | D, S, C, R skipped | 70.0% |
| MCP (direct agent data) | D, S, C, R, I skipped | 100% |
The infrastructure phase is pre-competitive. The annotation, recruited, grounded, displayed, and won (ARGDW) gates are where your content competes against every alternative the system has indexed. Competition is multiplicative too, so what you carry into annotation is what gets multiplied.
A brand that navigated all five DSCRI gates with 70% enters the competitive phase with 16.8% confidence intact. A brand on a feed enters with 70%. A brand on MCP enters with 100%. The competitive phase hasn’t started yet, and the gap is already that wide.
There’s an asymmetry worth naming here. Getting through a DSCRI gate with a strong score is largely within your control: the thresholds are technical, the failure modes are known, and the fixes have playbooks.
Getting through an ARGDW gate with a strong score depends on how you compare to all the alternatives in the system. The playbooks are less well developed, some don’t exist at all (annotation, for example), and you can’t control the comparison directly — you can only influence it.
Which means the confidence you carry into annotation is the only part of the competitive phase you can fully engineer in advance.
Optimizing your crawl path with schema, WebMCP, IndexNow, or combinations of all three will move the needle, and the table above shows by how much. But a feed or MCP connection changes what game you’re playing.
Every content type benefits from skipping gates, but the benefit scales with the business stakes at the end of the pipeline, and nothing has more at stake than content where the end goal is a commercial transaction.
The MCP figure represents the best case for the DSCRI phase: direct data availability bypasses all five infrastructure gates. In practice, the number of gates skipped depends on what the MCP connection provides and how the specific platform processes it. The principle holds: every gate skipped is an exclusion risk avoided and potential attenuation removed before competition starts.
A product feed is only the first rung. Andrea Volpini walked me through the full capability ladder for agent readiness:
- A feed gives the system inventory presence (it knows what exists).
- A search tool gives the agent catalog operability (it can search and filter without visiting the website).
- An action endpoint tips the model from assistive to agentic — the agent doesn’t just recommend the transaction, it closes it.

That distinction is what I built AI assistive agent optimization (AAO) around: engineering the conditions for an agent to act on your behalf, not just mention you.
Volpini’s ladder makes the mechanic concrete: each rung skips more gates, removes more exclusion risk, and eliminates more potential attenuation before competition starts. A brand with all three is playing a different game from a brand that’s still waiting for a bot to crawl its product pages.
Note: Always keep this in mind when optimizing your site and content — make your content friction-free for bots and tasty for algorithms.
DSCRI are absolute tests, ARGDW are competitive tests. The pivot is annotation.
Five gates. Five absolute tests. Pass or fail (and a degrading signal even on pass).
The solutions are well documented:
- Discovery failures fixed with sitemaps and IndexNow.
- Selection failures with pruning and entity signal clarity.
- Crawling failures with server configuration.
- Rendering failures with server-side rendering or the new pathways that bypass the problem entirely.
- Indexing failures with semantic HTML, canonical management, and structured data.
The infrastructure phase is the only phase with a playbook, and opportunity cost is the cheapest failure pattern to fix.
But DSCRI is only half the pipeline, and it’s the easiest to deal with.
After indexing, the scoreboard turns on. The five competitive gates (ARGDW) are competitive tests: your content doesn’t just need to pass, it needs to beat the competition. What your content carries into the kickoff stage of those competitive gates is what survived DSCRI. And the entry gate to ARGDW is annotation.
The next piece opens annotation: the gate the industry has barely begun to address. It’s where the system attaches sticky notes to your indexed content across 24+ dimensions, and every algorithm in the ARGDW phase uses those notes to decide what your content means, who it’s for, and whether it deserves to be recruited, grounded, displayed, and recommended.
Those sticky notes are the be-all and end-all of your competitive position, and almost nobody knows they exist.
In “How the Bing Q&A / Featured Snippet Algorithm Works,” in a section I titled “Annotations are key,” I explained what Ali Alvi told me on my podcast, “Fabrice and his team do some really amazing work that we actually absolutely rely on.”
He went further: without Canel’s annotations, Bing couldn’t build the algos to generate Q&A at all. A senior Microsoft engineer, on the record, in plain language.
The evidence trail has been there for six years. That, for me, makes annotation the biggest untapped opportunity in search, assistive, and agential optimization right now.
This is the third piece in my AI authority series.
- The first, “Rand Fishkin proved AI recommendations are inconsistent – here’s why and how to fix it,” introduced cascading confidence.
- The second, “AAO: Why assistive agent optimization is the next evolution of SEO,” named the discipline.
- The third, “The AI engine pipeline: 10 gates that decide whether you win the recommendation,” mapped the full pipeline.
- Up next: “How AI decides what your content means (and why it gets you wrong).”
