Reading view

How Google Discover qualifies, ranks, and filters content: Research

Google Discover pipeline

Google Discover runs on a structured, multi-stage pipeline with hard publisher blocks, strict image requirements, freshness decay, and heavy experimentation shaping what users see, according to new SDK-level research by Metehan Yesilyurt.

Why we care. Google Discover can drive massive traffic, but it often feels unpredictable. This research gives you a clearer view of how your content qualifies, gets ranked, or gets blocked — and where things can break before ranking even begins.

The details. Yesilyurt analyzed observable signals in Google’s Discover app framework and mapped a nine-stage flow. Google:

  • Crawls and understands your content.
  • Reads key meta tags like your image and title.
  • Classifies your content type (e.g., breaking news or evergreen).
  • Checks whether you’re blocked.
  • Matches your content to user interests.
  • Applies a server-side click-through rate prediction model.
  • Builds the feed layout.
  • Delivers your content.
  • Records user feedback.

One key finding. The publisher-level block happens before interest matching and ranking. If a user blocks you, your content never reaches the ranking stage.

  • Publisher blocking is powerful. One “Don’t show content from this site” action can suppress your entire domain. There’s no similar sitewide “boost” mechanism.

The ranking model. Your title, image quality, and engagement history are part of the evaluation process. The system uses a predicted click-through rate (pCTR) model on Google’s servers to estimate how likely someone is to click. The model isn’t visible, but the app shows which signals are sent to Google before ranking decisions, including:

  • Your page title (from og:title).
  • Your image size and quality.
  • How new your content is.
  • Past click and impression data for your URL.
  • Whether your images load successfully.

Freshness matters. Google Discover groups content into time windows:

  • 1 to 7 days old: strongest boost.
  • 8 to 14 days: moderate visibility.
  • 15 to 30 days: limited visibility.
  • 30+ days: gradual decline.

There’s a separate classification for strong evergreen content, but by default, newer content has an advantage.

Image and meta tag requirements. Google Discover reads six key page-level tags, including og:image and og:title. No image means no card.

  • To qualify for large, prominent cards, your images must be at least 1200px wide. Smaller images typically appear as thumbnails and often earn fewer clicks.
  • If certain tags are missing, Google Discover looks for backups — for example, it will try the Twitter title tag or the HTML title if og:title isn’t present.
  • Two specific meta tags — “nopagereadaloud” and “notranslate” — can stop your page from entering Google Discover entirely.

Personalization layers. Google Discover personalizes content using:

  • Google’s broader interest data tied to user behavior.
  • Publisher signals, including Publisher Center registration.
  • Individual actions like follows, saves, and dismissals.
  • Engagement signals, such as time spent reading.

If a user dismisses your story, the system stores that action permanently for that specific URL. It won’t resurface.

Experiments everywhere. During one observed session, about 150 server-side experiments were running simultaneously. Another 50+ feature controls affected how cards were displayed.

  • That means two similar users could see noticeably different feeds simply because they’re in different experiment groups.

Real-time feed updates. Google Discover isn’t static. The system can add, remove, or reorder content while someone is browsing, without a refresh.

The big takeaways. Success in Google Discover depends less on tricks and more on eligibility, trust, strong visuals, and sustained engagement — in a system that can filter you out before ranking even starts.

  • Publisher blocks happen before ranking.
  • Freshness is built into the system.
  • Strong images and clear titles are essential.
  • User dismissals are permanent.
  • Heavy experimentation makes volatility normal.

The research. Google Discover Architecture: Clusters, Classifiers, OG Tags, NAIADES – What SDK Telemetry Reveals

Anthropic clarifies how Claude bots crawl sites and how to block them

Anthropic bots

Anthropic updated its crawler documentation this week, clarifying how its Claude bots access websites and how you can block them.

  • Anthropic’s document explains what each bot does, how it affects AI training and search visibility, and how to opt out through robots.txt.

Why we care. If you publish or own content, you want control over how AI systems use it. Anthropic separates training crawlers, user-triggered fetches, and search indexing. Blocking one bot doesn’t block the others. Each choice carries different visibility and training trade-offs.

The robots. Anthropic uses three separate user agents:

  • ClaudeBot collects public web content that may be used to train and improve Anthropic’s generative AI models. If you block ClaudeBot in robots.txt, Anthropic said it will exclude your site’s future content from AI training datasets.
  • Claude-User retrieves content when a user asks Claude a question that requires access to a webpage. If you block Claude-User, Anthropic can’t fetch your pages in response to user queries. The company says this may reduce your visibility in user-directed search responses.
  • Claude-SearchBot crawls content to improve the quality and relevance of Claude’s search results. If you block Claude-SearchBot, Anthropic won’t index your content for search optimization, which may reduce visibility and accuracy in Claude-powered search answers.

How to block them. The bots respect standard robots.txt directives, including “Disallow” rules and the non-standard “Crawl-delay” extension, Anthropic said. To block a bot across your entire site:

User-agent: ClaudeBot
Disallow: /

  • You must add directives for each bot and each subdomain you want to restrict.
  • IP blocking may not work reliably because its bots use public cloud provider IP addresses, Anthropic said. Blocking those ranges could prevent the bot from accessing robots.txt. The company doesn’t publish IP ranges.

The document. Does Anthropic crawl data from the web, and how can site owners block the crawler?

SerpApi moves to dismiss Google scraping lawsuit

Bot detection maze

SerpApi is asking a federal court to dismiss Google’s lawsuit, arguing the company is misusing copyright law to restrict access to public search results.

  • The motion was filed Feb. 20, according to a blog post by SerpApi CEO and founder Julien Khaleghy.
  • Google sued SerpApi in December, alleging it bypassed technical protections to scrape and resell content from Google Search.

The details: SerpApi argues Google is improperly invoking the Digital Millennium Copyright Act (DMCA). According to Khaleghy:

  • The DMCA protects copyrighted works, not websites or ad businesses.
  • Google doesn’t own the underlying content displayed in search results.
  • Accessing publicly visible pages isn’t “circumvention” under the statute.

Google’s complaint alleged SerpApi:

  • Circumvented bot-detection and crawling controls.
  • Used rotating bot identities and large bot networks.
  • Scraped licensed content from Search features, including images and real-time data.

SerpApi said it doesn’t decrypt systems, disable authentication, or access private data. Khaleghy said SerpApi retrieves the same information available to any user in a browser, without requiring a login.

Khaleghy also argued Google admitted its anti-bot systems protect its advertising business — not specific copyrighted works — which he said undermines the DMCA claim.

SerpApi cites the Ninth Circuit’s hiQ v. LinkedIn decision warning against “information monopolies” over public data. It also cites the Sixth Circuit’s Impression Products v. Lexmark ruling to argue that public-facing content can’t be shielded by technical measures alone.

Catch up quick: The lawsuit follows months of escalating legal fights over scraping and AI data use.

  • Oct. 22: Reddit sued SerpApi, Perplexity, Oxylabs, and AWMProxy in federal court, alleging they scraped Reddit content indirectly from Google Search and reused or resold it. Reddit claimed the companies hid their identities and scraped at “industrial scale.” Reddit said it set a “trap” post visible only to Google’s crawler that later appeared in Perplexity results. Reddit is seeking damages and a ban on further use of previously scraped data.
  • Oct. 29: SerpApi said it would “vigorously defend” itself, calling Reddit’s language “inflammatory” and arguing public search data should remain accessible.
  • Dec. 19: Google sued SerpApi, alleging it bypassed security protections, ignored crawling directives, and scraped licensed Search content for resale. SerpApi responded that it operates lawfully and that accessing public search data is protected by the First Amendment.

By the numbers: SerpApi claims that, under Google’s interpretation of the DMCA, statutory damages could theoretically total $7.06 trillion — a figure it said exceeds U.S. GDP. The number reflects SerpApi’s calculation of potential per-violation penalties, not an actual damages demand.

What’s next. The case now moves to the court’s decision on whether Google’s claims can proceed.

Why we care: The outcome could reshape how SEO platforms, AI tools, and competitive intelligence software access SERP data. A win for Google could make third-party search data harder or riskier to obtain. A win for SerpApi could strengthen arguments that publicly accessible search results can be scraped and collected.

The blog post. Google v. SerpApi: We’re filing a Motion to Dismiss. Here’s why we’re in the right.

Dig deeper. Inside SearchGuard: How Google detects bots and what the SerpAPI lawsuit reveals

❌