How we measure brand visibility in AI search.
mapou measures how AI assistants alter commercial decisions inside high-intent consumer categories. The Visibility Index is the score we publish for that measurement. The monthly research reports are the proof.
The Visibility Index is a proprietary 0-100 score per brand, measured monthly across the five leading AI assistants. Below is the framework: what we measure, how the tiers work, what we disclose. The exact scoring weights, prompt set, and replication harness are reserved for paying clients.
Brands benchmarked
500
Segments covered
25
AI engines tested
5
Atomic verdicts/month
50,000
The framework
For each brand, we run a fixed set of buyer-intent prompts against ChatGPT, Perplexity, Gemini, Claude, and Grok in parallel. Every (prompt × engine) verdict is classified one of three ways:
- Cited. The brand was named as a recommended option.
- Mentioned. The brand appears but is not recommended as a top choice.
- Invisible. The brand does not appear in the response.
Verdicts are aggregated through a weighted formula across four buyer-intent dimensions to produce a single MVI score (0-100) per brand. The full leaderboard, per-engine breakdowns, and category top-cited brands are published openly at /research.
Definitions
Four terms recur throughout the research and on this page. Each links to the underlying evidence from the panel. The terms are not aspirational, they map to measured behavior.
GEO, Generative Engine Optimization
GEO is the practice of optimizing brand content, entity data, and citation sources so generative AI assistants like ChatGPT, Perplexity, Gemini, Claude, and Grok recommend the brand in their answers to buyer-intent prompts. It overlaps with SEO and AEO but is built around the synthesized-answer format AI assistants return, not the ranked-list format Google returns.
Generative AI assistants do not return ranked pages. They return synthesized answers built from extracted entity facts, citation sources, and the model's training-time exposure to the brand. GEO targets all three. Entity data shapes how the model represents you. Citation sources shape what the model says about you. Content rewrites at the buyer-intent level shape whether you appear in comparison and evaluation answers, where the actual purchase decision happens.
What GEO does not include.GEO does not cover ChatGPT's web-scraped content (that is page-level optimization, indistinguishable from SEO). It does not cover Google AI Overviews specifically (a separate signal pulled from Google's ranking, not from a generative model). It does not cover paid placement inside AI assistants. GEO is the organic-citation problem inside generative assistants.
Evidence: AI engines disagree on the same buyer prompt (Finding 01) · Different engines decide different funnel phases (Finding 05) · 47% of brand × prompt cells are invisible to all 5 engines (Finding 07)
FAQ, GEO
Is GEO the same as SEO?
No. SEO ranks pages on a search results list. GEO gets your brand cited inside an AI assistant's synthesized answer. The signals that win SEO are not the signals that win GEO.
Which AI engines does GEO target?
The five major consumer AI assistants: ChatGPT, Perplexity, Gemini, Claude, and Grok. mapou measures all five every month because their training data, retrieval policies, and ranking behavior diverge. Mean cross-engine agreement on rankings is below the threshold where a single-engine strategy is sufficient.
How is GEO different from AEO?
AEO is broader and older, applied to voice assistants, featured snippets, and any answer-engine surface. GEO specifically targets generative AI assistants. AEO techniques (FAQ schema, Speakable schema, structured data) help GEO too, but GEO adds entity-architecture work AEO does not require.
What does a GEO program actually do?
Three workstreams: (1) entity architecture, structuring brand and product data so AI engines parse it cleanly, (2) citation source diversification, building presence in third-party sources AI engines retrieve from, (3) content rewrites for buyer-intent comparison and evaluation prompts. Outcome measurement is per-engine MVI tracked monthly.
AEO, Answer Engine Optimization
AEO is the practice of optimizing structured data, FAQ schema, Speakable schema, and on-page content so answer engines (voice assistants, AI assistants, featured snippet engines) surface a brand or page as the direct answer to a question.
AEO targets the answer surface, the moment a query gets a single direct response instead of a ranked list. That happens on voice assistants reading aloud, on Google's featured snippets, on AI assistants synthesizing an answer, and on schema-backed answer cards. The optimization techniques converge: structured data so the engine knows what is answerable, FAQ pairs so question-shape queries get direct extraction, Speakable schema so voice engines know which text to read, concise factual content so the answer is short enough to surface.
Why AEO matters now. Answer engines are eating the click. The buyer who asked Siri, Alexa, or ChatGPT a question and got a direct answer never visits a results page. SEO competition for those queries is over before it begins because there is no list to rank in. AEO is how brands stay in front of buyers who do not click.
Evidence: Perplexity has the highest mention-share in our panel (32%) · Speakable cssSelector pattern in this site's schema
FAQ, AEO
Is AEO the same as SEO?
No. SEO targets ranked-list results pages. AEO targets the surface that returns a single direct answer (voice assistant, featured snippet, AI synthesized answer). AEO weights structured data, FAQ schema, Speakable schema, and concise factual answers far more heavily because answer engines extract from structure, not crawl-and-rank.
Is AEO the same as GEO?
AEO is broader and older. AEO covers any answer engine: voice assistants, featured snippets, AI assistants. GEO is specifically the AI assistant subset with extra emphasis on entity architecture and citation sources. AEO techniques help GEO too, but GEO requires more.
Which schema types matter most for AEO?
FAQPage schema for question-answer surfaces. Speakable schema for voice-readable content. Article and TechArticle for editorial content. Dataset for research/benchmark surfaces. BreadcrumbList for site structure. HowTo for procedural content.
What does AEO content actually look like?
Direct, concise, schema-backed answers to the questions buyers ask. Each answer 40-80 words, written in plain declarative voice, paired with the question as a header. FAQPage wraps every Q-A pair. Speakable schema marks the answer-readable region.
Can AEO be measured?
Partially. Featured snippet capture is measurable in SEO tools. Voice assistant readback is hard to track at scale. AI assistant citations are measured by mapou's MVI across ChatGPT, Perplexity, Gemini, Claude, and Grok every month.
MVI, mapou Visibility Index
MVI is a 0-100 proprietary score per brand combining citation rate across 5 AI engines and 4 buyer-intent funnel phases (Discovery, Filtered Discovery, Comparison, Evaluation), with Wilson 95% confidence intervals. It is the single number that measures AI search visibility.
The score is normalized to 0-100 within each segment so a 73 in skincare means the same thing as a 73 in luxury watches. Cross-segment comparison is the headline use case. The thresholds map to behavioral consequence (Default choice, Repeat use, First encounter, Not yet cited), not arbitrary cutoffs. See the behavioral-stage framework section below for the tier-by-tier breakdown.
Cadence. The same canonical buyer-intent prompts run on the 1st of each month against all 5 engines × 500 brands across 25 segments. Snapshots are date-stamped and archived to git so before/after comparisons are reproducible. Methodology version is bumped only when the prompt set or scoring formula changes.
Where to see MVI: Free /check tool (60-second MVI for any brand) · MVIP roster (every brand on the panel) · Per-segment leaderboards
FAQ, MVI
How is MVI scored?
Phase-weighted citation rate across 5 AI engines × 20 fixed buyer-intent prompts per segment. Discovery prompts (top of funnel) get 30% weight, Filtered Discovery 25%, Comparison 25%, Evaluation 20%. The weighted average becomes a 0-100 score. Wilson 95% confidence intervals are computed alongside so two brands with similar MVI can be statistically distinguished.
What are the MVI tiers?
Default choice (75-100): the brand AI cites by default. Repeat use (50-74): cited regularly enough to feel reliably present. First encounter (25-49): discovered and cited occasionally. Not yet cited (0-24): AI does not yet cite the brand for buyer-intent queries.
How often is MVI refreshed?
Monthly cadence on the top three engines (ChatGPT, Gemini, Claude); full 5-engine panel including Perplexity Sonar and Grok refreshed quarterly. The same 500 canonical buyer-intent prompts run against the panel on the 1st of each month. Methodology version is bumped only when the prompt set or scoring formula changes.
Is MVI comparable across categories?
Yes. The score is normalized to 0-100 within each segment so a 73 in skincare means the same thing as a 73 in luxury watches. Cross-segment comparison is the headline use case.
Where can I see MVI for my brand?
Free check at mapou.ai/check returns your visibility across ChatGPT, Gemini, Claude, and Grok in 60 seconds. The MVIP roster lists every brand on the panel with current MVI, tier, and embeddable badge.
Persona-Tuned MVI
Persona-Tuned MVIis the mapou Visibility Index weighted by a brand's actual buyer mix instead of the baseline persona. It is the AI search visibility number that matches who is actually asking, not the average of everyone.
Baseline MVI averages across a generic shopping-frame buyer. In our panel the leader brand changes under at least one buyer persona in 20 of 22 categories tested. The baseline number is real but represents only one slice of the buyer population. For categories where rankings move under persona, a brand needs the persona-weighted number to make budget decisions tied to actual buyer mix.
We measure MVI per brand under each of 12 personas (baseline plus 11 named buyer signals: budget, premium, professional, first-time, values-driven, gift-giver, time-pressed parent, brand-skeptic, gift-receiver, subscription-buyer, and two extended). The brand's actual buyer-mix percentages weight those persona-specific MVIs into a single score that matches the brand's real audience.
Where to see it: Free Persona Explorer (rankings under each persona) · Finding 08: leaders flip in 20 of 22 segments · Persona-Tuned MVI with custom buyer-mix weights ships in app.mapou.ai for paid customers.
FAQ, Persona-Tuned MVI
Why is the baseline MVI not enough?
Baseline MVI averages across a generic shopping-frame buyer. In our panel the leader brand changes under at least one buyer persona in 20 of 22 categories tested. For categories where rankings move under persona, a brand needs the persona-weighted number.
How is buyer mix translated into a Persona-Tuned MVI?
We measure MVI per brand under each of 12 personas (baseline plus 11 named buyer signals). The brand's actual buyer-mix percentages weight those persona-specific MVIs into a single score.
Is buyer mix the same as customer demographics?
Related, not identical. Personas in our panel are buyer-intent signals (price sensitivity, professional use, first-time vs repeat) rather than demographic profiles. The two correlate but the persona signal is what AI engines actually receive in the prompt.
How granular does buyer-mix data need to be?
Approximate is usable. A brand that knows it is 60% premium / 30% professional / 10% values-driven gets a meaningfully different Persona-Tuned MVI than a brand that is 60% budget / 40% first-time. The score is sensitive to the mix because rankings flip under buyer signal.
Where does Persona-Tuned MVI live?
Persona-Tuned MVI is a paid mapou feature. The free Persona Explorer at /research/personas shows how rankings shift under each persona. The custom Persona-Tuned MVI with a brand's real buyer-mix weights ships via app.mapou.ai for paid customers.
Engines tested
We test the five AI assistants real shoppers use. Engines are weighted equally in the index regardless of market share, to avoid the methodological gymnastics of weighting by share-of-traffic (which itself shifts month to month).
- ChatGPT (OpenAI gpt-5.4-mini, routed via Perplexity Agent API), monthly
- Perplexity (sonar, native Sonar API with bundled web grounding), quarterly
- Gemini (Google gemini-3-flash-preview, routed via Perplexity Agent API), monthly
- Claude (Anthropic claude-haiku-4-5, routed via Perplexity Agent API), monthly
- Grok (xAI grok-4-1-fast-non-reasoning, routed via Perplexity Agent API), quarterly
Cadence policy. Three engines (ChatGPT, Gemini, Claude) are tested every month with the full canonical prompt set. Perplexity Sonar and Grok are tested quarterly because each per-call web search costs $0.005 on the Perplexity Agent API and stacking five engines at monthly cadence pushed recurring cost outside the budget for a self-funded research program. Quarterly Perplexity and Grok runs are date-stamped, archived separately, and the resulting MVI score for each segment carries a freshness indicator showing the most-recent run including each engine. We disclose this trade-off explicitly rather than running the full 5-engine panel monthly and absorbing the cost into the line items, because transparency about cadence is more useful than smoothing it over.
Retrieval architecture. Four of the five engine cells (ChatGPT, Gemini, Claude, Grok) are routed through the Perplexity Agent API with the web_searchtool enabled, capped at one search invocation per response. This gives every monthly cell live web grounding from the same continuously-refreshed Perplexity index. The Sonar cell uses Perplexity's native Sonar API (web grounding bundled into token cost). Differs from native consumer behavior at chatgpt.com, claude.ai, and gemini.google.com (which use Bing-backed, Anthropic-internal, and Google-Search-backed retrieval respectively). The methodological gain is uniform retrieval across cells; the trade-off is the routed retrieval is not the exact retrieval a real consumer sees on each vendor's native app. Disclosed for completeness.
Analyzer.Brand citation extraction from each engine's response usesgpt-5.4-nano via direct OpenAI API with strict JSON schema for verdict classification (cited / mentioned / invisible) and competitor-name extraction.
Model fallback.Each routed engine cell uses Perplexity's Model Fallback feature with a primary + secondary model chain. If the primary is temporarily unavailable (e.g.,gemini-3-flash-preview gets deprecated mid-quarter), the request auto-fails-over to the secondary at no additional fee. Billed at the model that actually serves the request. Telemetry captures which model served each call so we can detect when fallbacks fire and disclose any model-mix shift on time-series charts.
Search recency. Every routed call passes search_recency_filter: "year"to the web_search tool. This restricts retrieval to content from the last 12 months. Removes long-tail outdated reviews while preserving the live-web reality our customers care about. As a side effect, it narrows the search target and reduces searches per call.
Search budget. Every routed call instructs the model to use exactly oneweb_search invocation per response (enforced via system prompt instruction, not API hard limit). Per-run telemetry confirms compliance — the average searches-per-call across the 25-segment monthly canonical is published with each run in data/research/runs/[run_id]/_telemetry.json.
Microsoft Copilot and Google AI Overviews are not yet included; both lack clean public APIs. We're evaluating SerpAPI integration as a future addition.
Why these specific models
Each engine cell uses the cost-efficient mini/haiku/flash variant of its vendor's model lineup, not the flagship. We chose this to keep monthly canonical cost within budget for a self-funded research program AND because direct A/B testing showed the cheaper variants produce equivalent brand rankings.
Direct A/B test (May 2026): we ran the ChatGPT cell on gpt-5.4-mini ($0.75/$4.50 per 1M tokens) and gpt-5.5 ($5/$30 per 1M tokens) head-to-head across three segments (mens-fashion, fs-banking, beauty-skincare), 60 brands × 20 prompts × 1 engine cell × 2 model variants. 1,200 atomic verdicts per variant. Result:
- Spearman ρ = 0.889 average across the three segments (rank correlation of the two leaderboards). 1.00 would be identical rankings; 0.0 would be independent.
- Top-3 brands stable in 2 of 3 segments; the third (mens-fashion) had a cosmetic top-1 swap with both candidates within a 4 MVI point window in both versions.
- Top-10 overlap = 90% on average. Long tail of the leaderboard is durable.
- Mean MVI |Δ| = 7.8 pts; p90 = 18 pts; max = 21 pts. Middle-pack brands show higher cross-model variance than top or bottom of the leaderboard.
Decision: stay on the cheaper variant. The flagship costs ~10× more per token and runs ~3× slower per call, but produces equivalent rankings within the noise floor of a single monthly snapshot. Absolute MVI scores in the middle of the leaderboard should be read as ranges (±5-10 pts) rather than point estimates.
Re-run criteria: when a new flagship lands (gpt-5.6, claude-opus-5, etc.), when a paying customer asks for tighter precision in the middle of the leaderboard, or when a model deprecation forces the choice. Raw audit data committed under data/research/ab-tests/2026-05-09T02-*/.
The four dimensions of AI visibility
The MVI is composed across four buyer-intent phases. Each phase tests a different relationship to brand visibility. The exact weighting is part of mapou's proprietary methodology.
Discovery
How shoppers find brands when they have not yet filtered by anything. Open category questions, top-brands lists, current-year recommendations, popular brands.
Why it matters: Rewards household-name recall. Where most established brand equity lives. The hardest phase to move into for brands that are not yet cited.
Filtered discovery
How shoppers find brands when they HAVE filtered, by budget, persona, use case, or values. Premium options, beginner-friendly, sustainable, professional-grade, and similar.
Why it matters: Rewards specialization. Niche brands often outperform leaders here. Where indie and challenger brands win share.
Comparison
How shoppers evaluate between named options. Head-to-head comparisons, alternatives to default-choice brands, best-for-an-attribute prompts, premium-vs-value tradeoffs.
Why it matters: Rewards brand authority in evaluative contexts. Where being cited as a viable alternative matters as much as being the leader.
Evaluation
How shoppers decide. Decision criteria, reliability and longevity, what to avoid, general buying advice.
Why it matters: Rewards being cited as a category authority. AI engines lean heavily on brands they associate with category expertise here.
Behavioral stage framework
mapou describes AI visibility as a sequence of three behavioral stages plus a pre-encounter state. The MVI score maps to where a brand sits in this sequence right now.
MVI 75+
The brand is the go-to recommendation in AI answers within its segment, often appearing first or most consistently. The risk now is competitor catch-up; the next step is AI commerce integration (ChatGPT Shopping, Perplexity Shopping).
MVI 50–74
The brand is cited regularly enough across prompts and engines that it feels familiar and reliably present, but not yet the default. Focus here is targeted citation strategy and attribution that ties AI visibility to revenue.
MVI 25–49
The brand is discovered and cited occasionally in AI answers, but not consistently enough to feel reliably present. Most fixable stage; gaps usually cluster around catalog architecture and third-party citation density.
MVI 0–24
AI does not yet cite the brand for buyer-intent queries in this segment. Almost every brand starts here. The work to move into first encounter is well-defined and starts with an AI Visibility Audit.
Derived insights, beyond the score
The MVI score answers “how visible is this brand?” but most strategic questions need a layer up. Five derived metrics computed from the same data, surfaced on every per-segment report and aggregated across all segments on the State of AI Search:
- Cross-engine divergence (0–1). One minus the mean pairwise Spearman rank correlation across the five engines. Near 0 = all engines rank brands similarly; near 1 = the engines act as independent channels. Tells you whether AI search is one channel or five for your category.
- Effective number of brands. The headline concentration metric, computed as 1 / Σ(MVI share²), the inverse Simpson index (equivalently a Hill number of order 2, equivalently 10,000 / HHI on a 0–1 share basis). Reads as “AI effectively recommends X brands out of N tracked.” Self-normalizing for sample size, so a 10-brand segment is directly comparable to a 20-brand one. Reported alongside the top-2 share for narrative clarity. Tells challenger brands whether the segment is dislodge-incumbents or show-up-at-all territory. (Gini coefficient is also computed and stored for back-compat; not surfaced in the UI because it carries an income-inequality analogy that doesn't map cleanly to AI-search visibility.)
- Discovery → Evaluation leakage(per brand, ±1). Citation rate in Discovery prompts minus citation rate in Evaluation prompts. Positive value = the brand is awareness-rich and conversion-poor in AI answers. Surfaces brands AI knows but won't recommend at the buying moment.
- Mention-but-not-cited gap (per brand, 0–1). Share of total visibility that came as mention only. AI uses the brand as context but rarely surfaces it as the answer. Recognition without recommendation.
- Kingmaker engine per phase. For each segment and funnel phase, the engine where the citation-rate gap between most-cited and least-cited brand is widest. That's where positioning matters most. Win that engine, win that phase.
All five run at render time on the published research data. Numbers move when AI behavior moves, no recompute step needed.
Statistical rigor
Each brand has 100 atomic observations per monthly run. With sample sizes this large, we can compute meaningful confidence intervals on the underlying citation rate.
We use Wilson 95% confidence intervals, narrower and more accurate than normal approximations for binary outcomes. Every brand on the leaderboard shows MVI ± CI bounds. Brands with similar MVI but non-overlapping CIs are statistically distinguishable.
Engine agreement: we report what fraction of (brand, prompt) pairs all engines agreed on. Higher agreement = higher confidence in the verdict.
Replicability
The MVI is built on a fixed prompt set generated once per category and reused every monthly run. The same brands are tested every month. Methodology version is locked (currently v1.0); changing the prompt set, weights, or thresholds requires a version bump.
All results are persisted with run-ID timestamps and immutable archives. Brand-level MVI history is queryable from the day this methodology was first run.
Re-running the same prompts every month means MVI deltas are paired comparisons, not noise. A brand moving from MVI 42 → 51 month-over-month reflects real signal change, not random LLM variability.
Pre-registered claims
Every monthly report ships with a small set of falsifiable, dated predictions about what the next run will show. Each prediction names a metric, a threshold, and a direction, and gets graded green or red against the data when that next run lands. Predictions are committed before the data exists, not chosen post-hoc, the same discipline empirical science papers use to prevent narrative-fitting.
The current cycle's active claims and the running track record are public on the State of AI Search page. Segment-scoped predictions also surface on the relevant per-segment report.
Why this matters: a methodology that can only be confirmed retroactively is not a methodology, it is a storytelling apparatus. Pre-registration is the difference between an analysis that could have come out differently and one that could not.
What we don't claim
Honest disclosure of what the measurement does not measure:
- Snapshot, not continuous. Each run is a moment in time. AI assistants are stochastic; results vary minute-to-minute.
- Region-specific. All API calls originate from US East. Brand visibility likely differs in other markets.
- Coverage gaps. Microsoft Copilot, Google AI Overviews, and voice assistants are not yet tested.
- Conflict of interest disclosed. mapou is a consultancy that benefits from AI search adoption. Scores are not adjusted for client or non-client status.
- One snapshot is not a trend. Single-month scores are diagnostic. Time-series gets meaningful at month 3+.
- Calibrated to high-intent queries. All 20 canonical prompts target commerce decisions across Discovery, Filtered Discovery, Comparison, and Evaluation. Findings should not be generalized to navigational or broad informational search behavior, which still rewards traditional SEO.
- Upstream influence, not downstream performance. MVI measures presence in AI-mediated discovery, analogous to impression share in early paid search. It is not a proxy for conversion lift, branded-search uplift, or assisted revenue. Tying citation rate to outcomes requires first-party clickstream and incrementality testing, which we ship as paid engagements.
- Hallucination not yet scored. Every (brand × prompt × engine) verdict is currently classified as cited / mentioned / invisible. We do not yet flag when an engine misattributes a product, confuses categories, or invents a brand. A failure-mode classification is on the v1.2 roadmap.
- API surface, not personalized chat. The canonical MVI run uses stateless engine APIs: no account, no chat history, no Memory, no custom instructions, no browsing personalization. Same prompt and same model produce the same distribution of answers regardless of who runs them, within sampling error. We do not capture how chatgpt.com, claude.ai, gemini.google.com, or perplexity.ai behave for a logged-in user with personalization signals enabled. Those consumer surfaces can vary per user. We measured the size of that effect via API system-prompt variants (no consumer-app scraping): leaders flip in 5 of 5 segments tested under at least one persona vs baseline, mean top-3 overlap drops to 67%. See Finding 08 on the macro page.
- Shopping-assistant framing, by design. The MVI baseline uses a shopping-assistant system prompt ("answer the user's question naturally and recommend specific brands and products where relevant"). This is the right scope for measuring commerce signals, but it is not zero-frame neutral. We isolated the framing effect by comparing it to a minimal "answer the user's question concisely" prompt across 5 segments: shopping framing inflates citation rates by 20-40pp on top brands and changes the leader in 4 of 5 segments. Baseline measures "what AI shopping assistants recommend," not "what AI engines say in raw form." Both are valid; the MVI deliberately measures the former.
Statistical caveats
Some claims in this methodology and on the research pages are well-powered inferential statements; others are exact descriptive observations of our specific panel. Disclosing the difference, layer by layer:
- Per-brand MVI scores have Wilson 95% confidence intervals. Computed and displayed on every leaderboard. With n=100 verdicts per brand at the engine level (20 prompts × 5 engines), differences of 5-10 percentage points are well outside sampling noise.
- Per-engine personalization sensitivity (Finding 09) has bootstrap 95% CIs. Each engine has n=270 (segment × persona) overlap observations across the panel. The 8-percentage-point spread between the most stable engine (Claude) and the most sensitive (Gemini) clears every pairwise confidence interval. The ranking is statistically real, not artifact.
- Per-cell citation rates (single segment × single engine × single persona) have wider precision.At n=20 prompts per cell, a 50% citation rate has a Wilson 95% CI of roughly 30-70%. Anywhere we cite specific small differences between cells (for example, "ChatGPT 70% vs Grok 70%"), readers should treat the comparison as suggestive within ±10pp of noise. Large effects (citation rate 0% baseline → 90% under one persona) survive this noise.
- Aggregate counts on n=27 categories are exact for our panel, descriptive not inferential. When we say "6 of 27 categories have 3+ different baseline leaders across the 5 engines," that is an exact count of our specific 22-category panel. As a generalization to all consumer categories, the binomial 95% CI on 6/22 is roughly 12-48%. The directional finding holds; precise extrapolation does not.
- "Leader held" / "unmovable leader" claims are descriptive at n=10 personas. When a brand holds #1 across all 10 buyer personas tested, that is literally true at n=10 — but you cannot statistically distinguish "always" from "9 times out of 10" with this sample size. The accurate phrasing is "held #1 across all 10 buyer personas tested," not "is unmovable." We are tightening any "unmovable" language in marketing copy to match this distinction.
- Cross-engine baseline disagreement is descriptive of the prompts we ran, not the population. The 6 categories with 3+ distinct baseline leaders are exact for our 20 canonical prompts per segment. A different (also legitimate) prompt set could produce different counts. We ship the same 20 prompts every month so time-series readings are comparable to themselves.
In short: directional claims and large-effect findings are well-supported. Precise small-effect claims and aggregate proportions on n=27 should be read as descriptive observations of the panel, not population parameters. Where we use phrases like "always" or "every category" we mean "in every case we tested," with the sample size implied by the surrounding context.
Author
Methodology designed by Arvin Nundloll, formerly Director of Strategy and Business Development at Comcast Advertising, with prior roles at NBCUniversal, Amazon, and DIRECTV. MBA, William & Mary. Based in New York City. About →
The full methodology, prompt set, and custom benchmarks.
The exact phase weights, the full canonical prompt set, the replication harness, and a custom MVI benchmark for your specific brand and competitive set are part of mapou's engagement deliverables. We also build private dashboards that track your MVI weekly with stability subsampling, run targeted audits at the prompt-template level, and ship a punch list of fixes.
- Custom prompt set tailored to your specific category and competitive set
- Exact MVI scoring weights + your brand's phase-by-phase decomposition
- Weekly stability subsampling for variance-aware reporting
- Methodology audit-ready documentation for board / investor reporting
- Private dashboard with month-over-month deltas and prompt-level diagnostics