mapou.ai
Free check

Methodology · v1.0

How we measure brand visibility in AI search.

The mapou Visibility Index (MVI) is a proprietary 0-100 score per brand, measured monthly across the five leading AI assistants. Below is the framework: what we measure, how the tiers work, what we disclose. The exact scoring weights, prompt set, and replication harness are reserved for paying clients.

Brands benchmarked

740

Segments covered

36

AI engines tested

5

Atomic verdicts/month

74,000

The framework

For each brand, we run a fixed set of buyer-intent prompts against ChatGPT, Perplexity, Gemini, Claude, and Grok in parallel. Every (prompt × engine) verdict is classified one of three ways:

  • Cited. The brand was named as a recommended option.
  • Mentioned. The brand appears but is not recommended as a top choice.
  • Invisible. The brand does not appear in the response.

Verdicts are aggregated through a weighted formula across four buyer-intent dimensions to produce a single MVI score (0-100) per brand. The full leaderboard, per-engine breakdowns, and category top-cited brands are published openly at /research.

Engines tested

We test the five AI assistants real shoppers use. Engines are weighted equally in the index regardless of market share, to avoid the methodological gymnastics of weighting by share-of-traffic (which itself shifts month to month).

  • ChatGPT (OpenAI gpt-4o-mini), monthly
  • Perplexity (sonar, web-grounded by default), quarterly
  • Gemini (gemini-2.5-flash, Google), monthly
  • Claude (claude-haiku-4-5, Anthropic), monthly
  • Grok (grok-3-mini, xAI, web-grounded), monthly

Cadence policy.Four engines (ChatGPT, Gemini, Claude, Grok) are tested every month with the full canonical prompt set. Perplexity is tested quarterly because sonar's per-search pricing dominates the run cost at monthly cadence, and the engine is web-grounded enough that a quarterly sample captures meaningful change without capturing every weekly drift. Quarterly Perplexity runs are date-stamped, archived separately, and the resulting MVI score for each segment carries a freshness indicator showing the most-recent run including each engine. We disclose this trade-off explicitly rather than running the full 5-engine panel monthly and absorbing the cost into the line items, because transparency about cadence is more useful than smoothing it over.

Microsoft Copilot and Google AI Overviews are not yet included; both lack clean public APIs. We're evaluating SerpAPI integration as a future addition.

The four dimensions of AI visibility

The MVI is composed across four buyer-intent phases. Each phase tests a different relationship to brand visibility. The exact weighting is part of mapou's proprietary methodology.

Discovery

How shoppers find brands when they have not yet filtered by anything. Open category questions, top-brands lists, current-year recommendations, popular brands.

Why it matters: Rewards household-name recall. Where most established brand equity lives. The hardest phase to move into for brands that are not yet cited.

Filtered discovery

How shoppers find brands when they HAVE filtered, by budget, persona, use case, or values. Premium options, beginner-friendly, sustainable, professional-grade, and similar.

Why it matters: Rewards specialization. Niche brands often outperform leaders here. Where indie and challenger brands win share.

Comparison

How shoppers evaluate between named options. Head-to-head comparisons, alternatives to default-choice brands, best-for-an-attribute prompts, premium-vs-value tradeoffs.

Why it matters: Rewards brand authority in evaluative contexts. Where being cited as a viable alternative matters as much as being the leader.

Evaluation

How shoppers decide. Decision criteria, reliability and longevity, what to avoid, general buying advice.

Why it matters: Rewards being cited as a category authority. AI engines lean heavily on brands they associate with category expertise here.

Behavioral stage framework

mapou describes AI visibility as a sequence of three behavioral stages plus a pre-encounter state. The MVI score maps to where a brand sits in this sequence right now.

Stage 3, Default choice

MVI 75+

The brand is the go-to recommendation in AI answers within its segment, often appearing first or most consistently. The risk now is competitor catch-up; the next step is AI commerce integration (ChatGPT Shopping, Perplexity Shopping).

Stage 2, Repeat use

MVI 50–74

The brand is cited regularly enough across prompts and engines that it feels familiar and reliably present, but not yet the default. Focus here is targeted citation strategy and attribution that ties AI visibility to revenue.

Stage 1, First encounter

MVI 25–49

The brand is discovered and cited occasionally in AI answers, but not consistently enough to feel reliably present. Most fixable stage; gaps usually cluster around catalog architecture and third-party citation density.

Not yet cited

MVI 0–24

AI does not yet cite the brand for buyer-intent queries in this segment. Almost every brand starts here. The work to move into first encounter is well-defined and starts with an AI Visibility Audit.

Derived insights, beyond the score

The MVI score answers “how visible is this brand?” but most strategic questions need a layer up. Five derived metrics computed from the same data, surfaced on every per-segment report and aggregated across all segments on the State of AI Search:

  • Cross-engine divergence (0–1). One minus the mean pairwise Spearman rank correlation across the five engines. Near 0 = all engines rank brands similarly; near 1 = the engines act as independent channels. Tells you whether AI search is one channel or five for your category.
  • Effective number of brands. The headline concentration metric, computed as 1 / Σ(MVI share²), the inverse Simpson index (equivalently a Hill number of order 2, equivalently 10,000 / HHI on a 0–1 share basis). Reads as “AI effectively recommends X brands out of N tracked.” Self-normalizing for sample size, so a 10-brand segment is directly comparable to a 20-brand one. Reported alongside the top-2 share for narrative clarity. Tells challenger brands whether the segment is dislodge-incumbents or show-up-at-all territory. (Gini coefficient is also computed and stored for back-compat; not surfaced in the UI because it carries an income-inequality analogy that doesn't map cleanly to AI-search visibility.)
  • Discovery → Evaluation leakage(per brand, ±1). Citation rate in Discovery prompts minus citation rate in Evaluation prompts. Positive value = the brand is awareness-rich and conversion-poor in AI answers. Surfaces brands AI knows but won't recommend at the buying moment.
  • Mention-but-not-cited gap (per brand, 0–1). Share of total visibility that came as mention only. AI uses the brand as context but rarely surfaces it as the answer. Recognition without recommendation.
  • Kingmaker engine per phase. For each segment and funnel phase, the engine where the citation-rate gap between most-cited and least-cited brand is widest. That's where positioning matters most. Win that engine, win that phase.

All five run at render time on the published research data. Numbers move when AI behavior moves, no recompute step needed.

Statistical rigor

Each brand has 100 atomic observations per monthly run. With sample sizes this large, we can compute meaningful confidence intervals on the underlying citation rate.

We use Wilson 95% confidence intervals, narrower and more accurate than normal approximations for binary outcomes. Every brand on the leaderboard shows MVI ± CI bounds. Brands with similar MVI but non-overlapping CIs are statistically distinguishable.

Engine agreement: we report what fraction of (brand, prompt) pairs all engines agreed on. Higher agreement = higher confidence in the verdict.

Replicability

The MVI is built on a fixed prompt set generated once per category and reused every monthly run. The same brands are tested every month. Methodology version is locked (currently v1.0); changing the prompt set, weights, or thresholds requires a version bump.

All results are persisted with run-ID timestamps and immutable archives. Brand-level MVI history is queryable from the day this methodology was first run.

Re-running the same prompts every month means MVI deltas are paired comparisons, not noise. A brand moving from MVI 42 → 51 month-over-month reflects real signal change, not random LLM variability.

Pre-registered claims

Every monthly report ships with a small set of falsifiable, dated predictions about what the next run will show. Each prediction names a metric, a threshold, and a direction, and gets graded green or red against the data when that next run lands. Predictions are committed before the data exists, not chosen post-hoc, the same discipline empirical science papers use to prevent narrative-fitting.

The current cycle's active claims and the running track record are public on the State of AI Search page. Segment-scoped predictions also surface on the relevant per-segment report.

Why this matters: a methodology that can only be confirmed retroactively is not a methodology, it is a storytelling apparatus. Pre-registration is the difference between an analysis that could have come out differently and one that could not.

What we don't claim

Honest disclosure of what the measurement does not measure:

  • Snapshot, not continuous. Each run is a moment in time. AI assistants are stochastic; results vary minute-to-minute.
  • Region-specific. All API calls originate from US East. Brand visibility likely differs in other markets.
  • Coverage gaps. Microsoft Copilot, Google AI Overviews, and voice assistants are not yet tested.
  • Conflict of interest disclosed. mapou is a consultancy that benefits from AI search adoption. Scores are not adjusted for client or non-client status.
  • One snapshot is not a trend. Single-month scores are diagnostic. Time-series gets meaningful at month 3+.
  • Calibrated to high-intent queries. All 20 canonical prompts target commerce decisions across Discovery, Filtered Discovery, Comparison, and Evaluation. Findings should not be generalized to navigational or broad informational search behavior, which still rewards traditional SEO.
  • Upstream influence, not downstream performance. MVI measures presence in AI-mediated discovery, analogous to impression share in early paid search. It is not a proxy for conversion lift, branded-search uplift, or assisted revenue. Tying citation rate to outcomes requires first-party clickstream and incrementality testing, which we ship as paid engagements.
  • Hallucination not yet scored. Every (brand × prompt × engine) verdict is currently classified as cited / mentioned / invisible. We do not yet flag when an engine misattributes a product, confuses categories, or invents a brand. A failure-mode classification is on the v1.2 roadmap.
  • API surface, not personalized chat. The canonical MVI run uses stateless engine APIs: no account, no chat history, no Memory, no custom instructions, no browsing personalization. Same prompt and same model produce the same distribution of answers regardless of who runs them, within sampling error. We do not capture how chatgpt.com, claude.ai, gemini.google.com, or perplexity.ai behave for a logged-in user with personalization signals enabled. Those consumer surfaces can vary per user. We measured the size of that effect via API system-prompt variants (no consumer-app scraping): leaders flip in 5 of 5 segments tested under at least one persona vs baseline, mean top-3 overlap drops to 67%. See Finding 08 on the macro page.
  • Shopping-assistant framing, by design. The MVI baseline uses a shopping-assistant system prompt ("answer the user's question naturally and recommend specific brands and products where relevant"). This is the right scope for measuring commerce signals, but it is not zero-frame neutral. We isolated the framing effect by comparing it to a minimal "answer the user's question concisely" prompt across 5 segments: shopping framing inflates citation rates by 20-40pp on top brands and changes the leader in 4 of 5 segments. Baseline measures "what AI shopping assistants recommend," not "what AI engines say in raw form." Both are valid; the MVI deliberately measures the former.

Statistical caveats

Some claims in this methodology and on the research pages are well-powered inferential statements; others are exact descriptive observations of our specific panel. Disclosing the difference, layer by layer:

  • Per-brand MVI scores have Wilson 95% confidence intervals. Computed and displayed on every leaderboard. With n=100 verdicts per brand at the engine level (20 prompts × 5 engines), differences of 5-10 percentage points are well outside sampling noise.
  • Per-engine personalization sensitivity (Finding 09) has bootstrap 95% CIs. Each engine has n=270 (segment × persona) overlap observations across the panel. The 8-percentage-point spread between the most stable engine (Claude) and the most sensitive (Gemini) clears every pairwise confidence interval. The ranking is statistically real, not artifact.
  • Per-cell citation rates (single segment × single engine × single persona) have wider precision.At n=20 prompts per cell, a 50% citation rate has a Wilson 95% CI of roughly 30-70%. Anywhere we cite specific small differences between cells (for example, "ChatGPT 70% vs Grok 70%"), readers should treat the comparison as suggestive within ±10pp of noise. Large effects (citation rate 0% baseline → 90% under one persona) survive this noise.
  • Aggregate counts on n=27 categories are exact for our panel, descriptive not inferential. When we say "6 of 27 categories have 3+ different baseline leaders across the 5 engines," that is an exact count of our specific 22-category panel. As a generalization to all consumer categories, the binomial 95% CI on 6/22 is roughly 12-48%. The directional finding holds; precise extrapolation does not.
  • "Leader held" / "unmovable leader" claims are descriptive at n=10 personas. When a brand holds #1 across all 10 buyer personas tested, that is literally true at n=10 — but you cannot statistically distinguish "always" from "9 times out of 10" with this sample size. The accurate phrasing is "held #1 across all 10 buyer personas tested," not "is unmovable." We are tightening any "unmovable" language in marketing copy to match this distinction.
  • Cross-engine baseline disagreement is descriptive of the prompts we ran, not the population. The 6 categories with 3+ distinct baseline leaders are exact for our 20 canonical prompts per segment. A different (also legitimate) prompt set could produce different counts. We ship the same 20 prompts every month so time-series readings are comparable to themselves.

In short: directional claims and large-effect findings are well-supported. Precise small-effect claims and aggregate proportions on n=27 should be read as descriptive observations of the panel, not population parameters. Where we use phrases like "always" or "every category" we mean "in every case we tested," with the sample size implied by the surrounding context.

Author

Methodology designed by Arvin Nundloll, formerly Director of Strategy and Business Development at Comcast Advertising, with prior roles at NBCUniversal, Amazon, and DIRECTV. MBA, William & Mary. Based in New York City. About →

For clients

The full methodology, prompt set, and custom benchmarks.

The exact phase weights, the full canonical prompt set, the replication harness, and a custom MVI benchmark for your specific brand and competitive set are part of mapou's engagement deliverables. We also build private dashboards that track your MVI weekly with stability subsampling, run targeted audits at the prompt-template level, and ship a punch list of fixes.

  • Custom prompt set tailored to your specific category and competitive set
  • Exact MVI scoring weights + your brand's phase-by-phase decomposition
  • Weekly stability subsampling for variance-aware reporting
  • Methodology audit-ready documentation for board / investor reporting
  • Private dashboard with month-over-month deltas and prompt-level diagnostics