GUIDE11 MIN READ

Schema Markup for AI: The 2026 Field Guide

Which JSON-LD schemas move the model — and which are noise. A working playbook, not a reference dump.

Structured data is one of the highest-leverage things you can ship for AI visibility. The catch: most teams either skip it because “schema is dead” (it isn't — Google retired presentation rich results, not parsing) or carpet-bomb their site with every schema they can find (which is worse than none, because contradictory data confuses the model).

This guide focuses on the five schemas that consistently move citation rates in 2026, the order to ship them, and the validation loop that catches drift before it tanks your score.

TL;DR

  • Implement Organization first. Without it the model doesn't reliably resolve “who you are.”
  • Add FAQPage to the 5–10 pages with the highest commercial intent. Single biggest unlock for citation slots.
  • Product + Offer if you sell SKUs. HowTo for tutorial content. Skip schemas that don't describe what's on the page.
  • Validate weekly. A single broken JSON-LD block invalidates the whole graph on that URL.
  • Do not stack “decorative” schemas (e.g. WebSite with hand-coded SearchAction) if they don't reflect on-page reality.

Why Schema Still Matters

The argument that “Google deprecated rich snippets, so schema is dead” is only half-true. Google deprecated visual rich results for some categories (FAQ rich results in regular search, for example). The parsing layer that consumes JSON-LD is still very much alive — and the LLM ecosystem now leans on it harder than traditional search ever did.

When an LLM crawls a page to ground an answer, it has two extraction paths: parse the rendered HTML (slow, noisy, prone to ad and navigation chrome contamination) or read the JSON-LD block (clean, machine-shaped, ambiguity-resistant). If your page provides good JSON-LD, the model takes that path. If it doesn't, the model falls back to HTML extraction — which is where attribution drift and hallucinated citations come from.

We've seen pages double their citation rate inside a single retraining cycle just from adding a clean Organization + FAQPage pair. The mechanism is unglamorous: the model now has unambiguous facts to anchor its sentences to, so it cites confidently instead of paraphrasing a third party who already has those facts in their schema.

The Five Must-Have Schemas

In rough order of leverage for a B2B SaaS site:

  1. Organization — site-wide, in your root layout.
  2. FAQPage — on every page that already has Q&A.
  3. Product + Offer — if you have SKUs or pricing tiers.
  4. HowTo — on step-by-step tutorial content.
  5. Article / NewsArticle — on every blog post.

That's it. Five schemas covers 95% of real-world citation mechanics. The other 50+ schemas in schema.org are either too narrow to matter or too generic to do anything beyond what the five above already provide.

Organization

Ship this once, site-wide, in your root document. The minimum viable version:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "CiteGEO",
  "url": "https://citegeo.ai",
  "logo": "https://citegeo.ai/logo.png",
  "sameAs": [
    "https://twitter.com/citegeo",
    "https://linkedin.com/company/citegeo",
    "https://github.com/citegeo"
  ],
  "description": "AI visibility intelligence platform...",
  "foundingDate": "2025-01-01"
}
</script>

sameAs is the unsung hero. It tells the model “these social/profile URLs all describe the same entity,” which is exactly the disambiguation signal the retrieval layer needs. Without sameAs, the model may resolve your brand name to a different company that happens to share it.

Add contactPoint if you have a support email, and founder if your founder is a notable entity in their own right (e.g. has a Wikipedia page or has founded other indexed companies). Skip address unless you're a local business — for SaaS it just adds noise.

FAQPage

This is the single highest-leverage schema for AI visibility. The mechanism is simple: LLMs extract Q&A pairs verbatim when generating answers, and your branded answer becomes the citation. Pages with clean FAQPage markup get cited 2–4× more often than equivalent pages without it.

Two rules. First: only mark up Q&A that already exists on the rendered page. Schema that describes content the user can't see violates Google's structured data guidelines and trains the model to distrust your markup. Second: write the answers as if they'll be quoted in isolation — because they will be. No “as mentioned above”, no pronouns that refer to off-page context.

A minimum-viable FAQPage block:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "How does CiteGEO score visibility?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "CiteGEO runs a weighted bundle of prompts across all five major engines and rolls citation, mention, and sentiment data into a 0–100 score."
      }
    }
  ]
}
</script>

One small tip that makes a disproportionate difference: keep each answer under 60 words. Shorter answers travel further — they fit inside model context windows during answer composition without getting truncated, and they're more likely to be quoted intact.

Product & Offer

If you have pricing tiers or SKUs, ship Product with a nested Offer per tier. The model uses this to answer “how much does X cost” and “does X have a free plan” queries — both extremely high commercial intent.

{
  "@type": "Product",
  "name": "CiteGEO Pro",
  "description": "...",
  "brand": { "@type": "Brand", "name": "CiteGEO" },
  "offers": [
    {
      "@type": "Offer",
      "price": "53",
      "priceCurrency": "USD",
      "availability": "https://schema.org/InStock"
    }
  ]
}

Critical: price must match what your pricing page says, every day. If your pricing changes during a promotion and the schema doesn't, the model will cite the stale number for weeks — and your prospects will quote it back at you on sales calls. Tie this to your pricing CMS or hard-code with care.

Skip aggregateRating unless you have real, verifiable ratings tied to a review platform. Fake or hand-counted ratings get flagged eventually and damage trust scores across the entire domain.

HowTo

For tutorial content. The model loves HowTo markup because it can lift the step list directly into an answer. Pages with marked-up HowTo steps consistently appear in “how do I X” answers — even on engines that don't officially surface HowTo rich results.

{
  "@type": "HowTo",
  "name": "How to set up llms.txt",
  "step": [
    {
      "@type": "HowToStep",
      "name": "Create the file",
      "text": "In your site root, create a file named llms.txt."
    },
    {
      "@type": "HowToStep",
      "name": "Add your site overview",
      "text": "Start with an H1 containing your site name..."
    }
  ]
}

One tactical note: each HowToStep name should read well in isolation. Models will quote a single step out of order when answering “what's step 3 of X” — so step names like “Continue” or “Next” are useless. Use descriptive verbs.

Article & NewsArticle

Every blog post gets Article. News pieces get NewsArticle. The required fields: headline, author, datePublished, publisher. The optional but high-value fields: dateModified (keep this fresh — the model uses it for recency weighting), image (give a high-res hero image even if your design doesn't render one — the model uses it as a relevance signal), and articleSection (helps with topic clustering across the site).

One common bug: setting datePublished in the future, or updating dateModified on every deploy regardless of whether the content changed. Models flag both as suspicious and may downweight the page's citation eligibility. We covered this in our citation mechanics breakdown.

AI-Specific Extensions

A handful of new properties matter for LLM-targeted optimization in 2026:

  • creativeWorkStatus — set to "Published" for canonical content, "Draft" for work-in-progress. Helps the model avoid citing pages you're still iterating on.
  • copyrightHolder — for licensed or third-party content. Models increasingly respect this and avoid quoting work they shouldn't.
  • inLanguage — explicit language tag. Critical for multi-language sites; without it the model often confuses your translated pages.

What we don't recommend yet: speculative properties from extension proposals that haven't been formally adopted intoschema.org. The model isn't parsing them and you may inadvertently invalidate the whole block.

Validation & Monitoring

Schema breaks silently. A trailing comma, an unescaped quote inside a string, or a structured-data update from your CMS that drops a required field — and your entire JSON-LD block becomes invisible. Worse: there's no error in your console, no warning in Lighthouse, no email from the model. The citation rate just drifts down.

The validation loop we recommend:

  1. Pre-deploy: run the Schema.org validator (or Google's Rich Results Test) on every PR that touches a template or layout that emits JSON-LD.
  2. Post-deploy: hit Google's URL Inspection on your top 5 commercial-intent pages weekly. Look at the structured data tab — if a schema disappears that was there last week, fix it.
  3. Continuous: CiteGEO's RAG-readiness audit checks JSON-LD coverage on every page it scans and grades the domain end-to-end. See the full 8-axis grading rubric here.

Anti-Patterns to Avoid

We see the same five mistakes across audits every week. None require expertise to fix — just discipline.

1. Marking up content the page doesn't actually show

The classic: a FAQPage block with eight questions that exist only in the schema, not on the rendered page. This is a policy violation. Models trained after late-2024 explicitly downweight domains that do this.

2. Hand-counted aggregate ratings

“5 stars from 12 customer interviews” is not anaggregateRating. Use a verifiable third-party review platform (G2, Capterra, TrustRadius) or omit the property.

3. Stale datePublished

We've seen sites with every blog post claiming datePublished: 2018-01-01 because the CMS template was never updated. Models heavily weight recency for commercial-intent queries — old dates push your content out of the citation set even if the content is current.

4. WebSite + fake SearchAction

A SearchAction that points to a search endpoint your site doesn't actually have, or to a result page that returns 404 for most queries. This used to be a Google sitelinks search box trick — it's now a credibility downgrade. Remove unless your site truly supports the action.

5. Schemas that contradict on-page facts

Pricing in schema that doesn't match the pricing page. Product name in Product that doesn't match the H1. Founder name in Organization that doesn't match your About page. The model triangulates and discards both signals when they disagree.

Ship the five schemas above, validate weekly, and let the model do its job. Want CiteGEO to audit your schema coverage automatically? Create a free account — the RAG-readiness grade includes a JSON-LD coverage check on every URL we crawl.