Why Does Google's AI Overview Cite Some Sites and Ignore Others?

Somewhere in a data centre you will never visit, a machine is making a decision about your content. Not whether to rank it. Not whether to index it. Whether to trust it enough to say it out loud.

This is the question that has replaced "how do I rank on Google?" - and almost nobody is asking it yet. The old question was about position. This question is about selection. Position meant appearing on a list. Selection means being chosen as the answer. The difference between the two is the difference between being on a bookshelf and being quoted in a conversation.

In the first article of this series, we explored the tectonic shift: AI answer engines are replacing traditional search, and the traffic patterns of the open web are fundamentally changing. But understanding that the shift is happening only gets you so far. The question that matters now - the question worth 2,500 words and your full attention - is how do these machines decide which content to cite and which to ignore?

Because they do decide. Every time Google's AI Overview generates an answer, every time Perplexity synthesises a response, every time ChatGPT draws on web sources to answer a query, a selection process is occurring. Millions of pages are considered. A handful are cited. Most are invisible.

What separates the cited from the invisible is not what most people think.

A Tale of Two Bakeries

Let me tell you about two websites that sell artisan sourdough. This is a true pattern, composited from dozens of real cases, and it illustrates the citation selection problem better than any abstract explanation could.

The first site - call it Breadcraft - has been online since 2016. It has 140 blog posts about sourdough techniques, fermentation science, flour sourcing, and starter maintenance. The writing is excellent. The photography is beautiful. The site has accumulated over 3,000 backlinks from food bloggers, recipe aggregators, and a few major publications. By every traditional SEO metric, Breadcraft is a strong domain.

The second site - call it Levain Lab - launched in 2023. It has 40 articles. Fewer backlinks. Less domain authority. By the old rules, Levain Lab should be invisible next to Breadcraft.

But when someone asks Google's AI Overview "what is the ideal hydration for a beginner sourdough loaf?" - it is Levain Lab that gets cited. Breadcraft is nowhere in the generated answer.

Why?

The answer is not about who wrote better content. Breadcraft's article on hydration is arguably more comprehensive. The answer is about how that content is structured, marked up, and made legible to a machine that is not browsing - it is reading.

How an AI Engine Actually Reads

To understand citation selection, you need to understand something that sounds simple but has profound implications: AI answer engines do not experience your website the way a human does.

A human visitor arrives at your page and sees a headline, a hero image, some body text, maybe a sidebar with related articles. They scan, they scroll, they absorb the layout and the aesthetic and the tone. The experience is holistic and visual.

An AI engine arrives at your page and sees something entirely different. It sees the HTML document object model - the raw structure beneath the presentation. It sees heading tags and paragraph tags and list items and metadata. If you have implemented structured data markup, it sees a layer of semantic annotation that tells it, in machine-readable language, what each piece of content means. Not just what the words say, but what the entities are, how they relate to each other, and what claims are being made.

This is the critical distinction. A human reads for meaning and infers structure. A machine reads structure and infers meaning.

When Levain Lab published its article on sourdough hydration, the founder - a former software engineer who baked as a hobby - did something that most content creators never think to do. She marked up her content with structured data. Not just the basic Article schema that an SEO plugin auto-generates, but detailed, attribute-rich markup that identified the specific entities in her content: hydration ratios as quantitative values, flour types as defined materials, fermentation times as durations, and her own authorship as a verifiable person entity with credentials linked to other published work.

Breadcraft's article, by contrast, was beautifully written prose with basic metadata. To a human reader, it was superior. To the AI engine parsing both pages for citation-worthy content, Levain Lab's article was a structured knowledge object, and Breadcraft's was a wall of text with a title tag.

The data bears this out at scale. Pages with attribute-rich schema markup - structured data that goes beyond the minimum and actually describes the entities and relationships within the content - achieve a citation rate of approximately 61.7% in AI-generated answers. Pages with generic or minimal schema perform dramatically worse.

That single number - 61.7% - should rewrite how every content creator thinks about their work.

The Citation Funnel

Imagine the process an AI engine goes through when it needs to answer a question. The mechanics vary between systems - Google's AI Overview operates differently from Perplexity, which operates differently from ChatGPT's browsing mode - but the broad architecture follows a consistent pattern.

Stage one: Candidate retrieval. The engine identifies a pool of potentially relevant pages. This is the closest step to traditional search, and traditional signals like topical relevance and domain reputation play a role here. If your page is not in the candidate pool, nothing else matters. This is table stakes.

Stage two: Content parsing. The engine reads the content of each candidate page. This is where the shift from retrieval to synthesis begins. The machine is not just checking whether the page mentions the query terms - it is parsing the actual content, extracting claims, identifying entities, and mapping the relationships between them. Pages with clear structure and semantic markup are dramatically easier for the engine to parse. Pages that bury their answers in long narrative preambles, or that scatter relevant information across multiple sections without clear connections, are harder to parse and therefore less likely to be cited.

Stage three: Trust evaluation. The engine assesses whether it trusts the content enough to base its answer on it. This is the least understood and most consequential stage. Trust evaluation involves multiple signals: the consistency of the content with other trusted sources, the verifiability of the author or publishing entity, the presence of structured claims that the engine can cross-reference, and the recency and specificity of the information. An engine will not cite a page that contradicts the consensus of its other sources unless that page provides compelling, structured evidence for the divergence.

Stage four: Citation selection. From the pages that survived parsing and trust evaluation, the engine selects which to cite in its generated answer. This is not always the "best" page in an absolute sense - it is the page whose content is most directly responsive to the specific query, most parseable, and most trustworthy in context. A mediocre page with excellent structure can beat an excellent page with mediocre structure, because the engine needs to be confident in what it is quoting.

This four-stage funnel explains why Levain Lab beats Breadcraft. Breadcraft passes stage one easily - it is a strong, relevant domain. But in stages two and three, its unstructured content becomes a liability. The engine has to work harder to extract the answer, and it has fewer structured signals to evaluate trustworthiness. Levain Lab's content, by contrast, flows through the funnel with minimal friction. The entities are marked up. The answer is clearly structured. The author is a verifiable entity. The engine can parse, evaluate, and cite with confidence.

The Entity Web

There is a concept in information science called entity resolution - the process of determining that two references point to the same real-world thing. When you write "sourdough starter" and a food science paper writes "levain culture" and a baking textbook writes "natural yeast preferment," an AI engine needs to understand that all three refer to the same entity.

The sites that earn the most citations are the ones that make this resolution effortless. They use consistent entity naming. They link their entities to established knowledge bases. They create what amounts to a web of interconnected concepts within their own content - a local knowledge graph that the AI engine can traverse.

Think of it this way. If your website is a room full of loose papers, the AI engine has to read every page, figure out what each one is about, and hope it can piece together a coherent answer. If your website is a well-organised filing cabinet with labelled drawers, indexed folders, and cross-references between related documents, the engine can find what it needs in seconds and trust that the organisation reflects genuine expertise.

The entity web is not about keyword stuffing. It is not about cramming schema markup onto every page. It is about creating genuine semantic connections between the concepts in your content - connections that reflect how an expert actually thinks about the subject.

Levain Lab's founder understood this intuitively, because her engineering background taught her to think in systems. Every article on her site referenced related articles not just with hyperlinks but with semantic markup that told the AI engine why those articles were related. Her page on hydration ratios was semantically connected to her page on flour protein content, which was semantically connected to her page on gluten development, which was semantically connected to her page on crumb structure. The engine could traverse this web and build a comprehensive understanding of the domain - and that comprehensive understanding made it more confident in citing any individual page.

Breadcraft's articles, by contrast, were isolated pieces. Excellent individually. But disconnected in the way that mattered to the machine. There was no entity web. No semantic connective tissue. Just 140 good articles floating in the same domain without a structured relationship between them.

What the Machine Cannot See

Here is something that should unsettle you: the visual design of your website is almost entirely invisible to an AI citation engine.

The hero image you spent $2,000 on. The custom typeface. The animated scroll effects. The carefully chosen colour palette. The AI engine does not see any of it. It sees the document structure beneath. If that structure is clean, well-marked, and semantically rich, your page is a strong citation candidate regardless of how it looks. If that structure is messy, generic, or unmarked, your page is a weak citation candidate regardless of how beautiful it is.

This is a hard truth for designers and brand-focused marketers. Visual design matters enormously for human experience. It matters almost not at all for AI citation. The investment in structure - in markup, in entity identification, in answer-first formatting - is the investment that determines whether your content gets cited or ignored.

This does not mean design is irrelevant. When a human does click through an AI citation, the experience they find on your site matters immensely for conversion. The 14.2% conversion rate and 10-minute session duration we discussed in Article 1 are partly attributable to the quality of the destination experience. But the selection decision - the decision about whether your content appears in the AI answer at all - is made on structure, not aesthetics.

The Answer-First Imperative

There is a pattern in the content that gets cited most frequently, and it is almost embarrassingly simple: the answer appears near the top.

In traditional web content, there is an entire industry built around "hook" structures - opening with a story, building tension, creating curiosity, and eventually delivering the payoff. This works for human readers. It is terrible for AI citation.

An AI engine parsing your page for a specific answer is not reading for pleasure. It is scanning for the most direct, most clearly stated, most parseable answer to the query it is trying to address. If your answer to "what is the ideal hydration for a beginner sourdough loaf?" appears in the seventh paragraph after an anecdote about your grandmother's kitchen, the engine has to do significant work to extract it. If the answer appears in a clearly marked section near the top - "A beginner sourdough loaf typically uses 65-70% hydration" - the engine can extract, verify, and cite it with minimal friction.

This is the answer-first imperative. It does not mean your content cannot have narrative, depth, or personality. It means the core answer should be structurally accessible - marked up, positioned prominently, and stated clearly - so the machine can find it without excavation.

The best AI-citable content does both: it leads with a clear, structured answer and then provides the depth, context, and narrative that makes a human reader stay for ten minutes. Structure for the machine. Substance for the human. Both, not either-or.

The Recency Signal

One signal that is gaining importance in citation selection - and that most content strategies underestimate - is recency. Not just the publication date, but the last updated date and the freshness of the data within the content.

AI engines are increasingly sensitive to temporal relevance. A page about sourdough hydration that was last updated in 2024 with references to current flour milling standards will be preferred over a page from 2019 that contains the same information but has never been refreshed. The content may be identical in substance, but the signal of active maintenance tells the engine that someone is still standing behind this information.

This has implications for content strategy. The old SEO model of "publish and rank" is giving way to a model of "publish, maintain, and stay current." Pages that are regularly updated - not rewritten, but genuinely updated with current data and references - accumulate a freshness signal that strengthens their citation candidacy over time.

It also creates an asymmetric advantage for smaller publishers. A niche site with 40 articles that are all actively maintained can outperform a large site with 4,000 articles that were published once and forgotten. The AI engine does not care about your archive depth. It cares about whether the content it is considering citing reflects current knowledge.

The Authorship Question

There is one more signal that deserves attention, because it sits at the intersection of everything we have discussed and points toward where this entire landscape is heading.

AI engines are getting better at evaluating authorship. Not just "does this page have an author byline?" but "is this author a real, verifiable entity with a track record on this topic?"

Google's own documentation on content quality has emphasised expertise, experience, authoritativeness, and trustworthiness - what the industry calls E-E-A-T - for years. But in the traditional search context, these were largely proxy signals. Backlinks from authoritative sites served as a rough indicator of authority. Author bylines were decorative more than functional.

In the AI citation context, authorship is becoming a first-class signal. Engines are beginning to connect content to authors, authors to credentials, credentials to institutions, and institutions to trust networks. A page about cardiac health written by a verified cardiologist with published research is treated differently from the same information written by an anonymous content mill. The content may be identical. The trust signal is not.

This is where the future gets interesting - and where the structural foundations being laid today will matter enormously tomorrow. The ability to verify who created a piece of content, what their credentials are, and whether those credentials are genuine is becoming a prerequisite for earning AI citations in high-stakes domains. Today, this verification is rough and incomplete. Tomorrow, it will need to be cryptographically provable.

We are not there yet. But the direction is unmistakable.

What You Can Do Now

If Article 1 was about awareness - understanding that the ground has shifted - this article is about the first layer of response. Here is what the citation selection mechanism tells us about how to build content that AI engines want to cite.

Structure your content for machine parsing, not just human reading. This means semantic HTML, heading hierarchies that reflect the actual information architecture, and structured data markup that goes beyond the minimum. Mark up your entities - the people, organisations, products, and concepts in your content - with enough specificity that an AI engine can identify them unambiguously.

Build your entity web. Do not publish isolated articles. Create semantic connections between your content pieces that reflect genuine topical expertise. Make it easy for an engine to traverse your site and build a comprehensive understanding of your domain.

Lead with answers. Place your clearest, most direct response to the query near the top of your content, in a structurally accessible format. Then go deep. Provide the narrative, the context, the nuance - all the things that make a human reader stay and engage. But give the machine what it needs first.

Maintain your content. Update publication dates when you refresh content. Keep your data current. An actively maintained page with current references signals to the engine that someone is still responsible for this information.

Establish verifiable authorship. Connect your content to real, identifiable authors with credentials that can be confirmed. This signal is growing in importance, and the sites that build it early will have a structural advantage as verification standards mature.

None of this requires a massive budget. None of it requires a team of engineers. What it requires is a shift in how you think about your content - from something designed to be found by a search engine to something designed to be trusted by an answer engine.

The difference is everything.

Next in the series: "Can You Actually Measure If AI Trusts Your Content?" - exploring the emerging science of AI trust scoring and what it means for the future of content credibility.