A component system is one of those engineering miracles that only becomes “visible” when it fails. When it’s healthy, product teams ship faster, UI stays consistent, and the library feels boring in the best possible way. When it’s not, everyone gets hit at once: spacing drift, broken focus rings, “why did this dropdown stop working with the keyboard?”, and layout regressions that show up in production screenshots before they show up in tests. The root cause usually isn’t incompetence. It’s that we test component libraries like features (unit tests and a few snapshots) instead of like infrastructure (contracts, compatibility guarantees, release gates, and aggressive automation). A component library isn’t a single app—it’s a platform. Platforms need guardrails that catch the classes of failures that unit tests cannot see: visuals, interaction behavior, accessibility semantics, and performance regressions. features infrastructure This article lays out a repeatable strategy: contract tests to prevent behavioral drift, visual regression to catch rendering changes, accessibility gates to stop usability backsliding, and performance budgets to keep the system from slowly turning into a dependency iceberg. The goal is not more tests. The goal is fewer surprises. contract tests visual regression accessibility gates performance budgets TL;DR Unit tests are necessary but insufficient for component libraries; they miss visual drift, focus bugs, keyboard regressions, and semantic a11y issues.Contract tests define what must always remain true across versions: states, invariants, keyboard behavior, and focus management.Visual regression should be scoped to high-risk components and stabilized to reduce flake (freeze time, disable animations, deterministic data).Accessibility gates should fail CI for new serious/critical violations, plus require a lightweight manual keyboard checklist for interactive changes.Performance budgets (bundle size + render timing) keep the library from getting heavier and slower over time.CI should be layered: fast checks on every PR; deeper suites on main; matrix and long-running checks nightly. Unit tests are necessary but insufficient for component libraries; they miss visual drift, focus bugs, keyboard regressions, and semantic a11y issues. Unit tests are necessary but insufficient Contract tests define what must always remain true across versions: states, invariants, keyboard behavior, and focus management. Contract tests Visual regression should be scoped to high-risk components and stabilized to reduce flake (freeze time, disable animations, deterministic data). Visual regression Accessibility gates should fail CI for new serious/critical violations, plus require a lightweight manual keyboard checklist for interactive changes. Accessibility gates new Performance budgets (bundle size + render timing) keep the library from getting heavier and slower over time. Performance budgets CI should be layered: fast checks on every PR; deeper suites on main; matrix and long-running checks nightly. CI should be layered 1. Treat Your Component System Like Infrastructure (Because It Is) A component system is an API surface, not just UI. Even if it’s “just buttons and modals,” it functions as shared infrastructure for many teams and many code paths. That changes what “correctness” means. Infrastructure has properties that product code often doesn’t: Many consumers: the “same” component is embedded in different layouts, different themes, different routing stacks, and different performance envelopes.Long life: the library will outlive multiple apps, redesigns, and framework upgrades.Compatibility expectations: consumers expect upgrades to be safe, predictable, and reversible.High leverage: a small regression multiplies across the ecosystem. Many consumers: the “same” component is embedded in different layouts, different themes, different routing stacks, and different performance envelopes. Many consumers Long life: the library will outlive multiple apps, redesigns, and framework upgrades. Long life Compatibility expectations: consumers expect upgrades to be safe, predictable, and reversible. Compatibility expectations High leverage: a small regression multiplies across the ecosystem. High leverage So instead of asking, “Does it work in isolation?”, ask infrastructure questions: What behaviors are consumers depending on (even implicitly)?What guarantees must not change without a major version bump?What failures are catastrophic (e.g., broken keyboard interaction in common flows)?What signals should block a release? What behaviors are consumers depending on (even implicitly)? What guarantees must not change without a major version bump? What failures are catastrophic (e.g., broken keyboard interaction in common flows)? What signals should block a release? Testing “like infrastructure” means your test suite is designed to prevent drift, detect regressions early, and make failures diagnosable. It should be opinionated about what matters. prevent drift detect regressions early make failures diagnosable 2. Why Unit Tests Alone Don’t Protect Component Libraries Unit tests are great at verifying logic you own: formatting, reducers, pure utilities, state machines. But component libraries fail in places that unit tests don’t naturally cover. Visual drift Visual drift A token change, CSS refactor, typography update, or layout tweak can subtly break spacing and alignment.Unit tests rarely detect “this is 2px off” or “text wraps one line earlier.” A token change, CSS refactor, typography update, or layout tweak can subtly break spacing and alignment. Unit tests rarely detect “this is 2px off” or “text wraps one line earlier.” Interaction regressions Interaction regressions “Click opens the menu” is not the same as “Tab order is correct” or “Escape closes and focus returns to the trigger.”Focus traps, roving tabindex, and aria-activedescendant patterns can break without obvious runtime errors. “Click opens the menu” is not the same as “Tab order is correct” or “Escape closes and focus returns to the trigger.” Focus traps, roving tabindex, and aria-activedescendant patterns can break without obvious runtime errors. tabindex aria-activedescendant Accessibility semantics Accessibility semantics The accessibility tree is not your TypeScript type system.Roles, labels, name computation, and state announcements are runtime truths. The accessibility tree is not your TypeScript type system. Roles, labels, name computation, and state announcements are runtime truths. Integration realities Integration realities Consumers embed components in messy layouts: nested scroll containers, portals, stacked modals, dynamic content, RTL, and localization expansion.“Rendered output” in a unit test is not equivalent to “works in the real constraints consumers apply.” Consumers embed components in messy layouts: nested scroll containers, portals, stacked modals, dynamic content, RTL, and localization expansion. “Rendered output” in a unit test is not equivalent to “works in the real constraints consumers apply.” Unit tests are still essential. They’re just the innermost layer. For component systems, you need additional layers that test the public guarantees: contracts, visuals, and accessibility. Here’s a testing pyramid that fits design systems better than the classic “mostly unit, a few end-to-end” framing: /\ / \ /E2E \ (few) Cross-component user journeys /------\ / Visual \ (some) Rendering + key states per component /----------\ / Contract \ (many) Invariants: keyboard, focus, semantics /--------------\ / Unit / Logic \ (many) Pure functions, state machines, helpers /__________________\ /\ / \ /E2E \ (few) Cross-component user journeys /------\ / Visual \ (some) Rendering + key states per component /----------\ / Contract \ (many) Invariants: keyboard, focus, semantics /--------------\ / Unit / Logic \ (many) Pure functions, state machines, helpers /__________________\ The goal is to push most confidence into the contract + visual + a11y layers, where component regressions actually live, while keeping E2E limited to a small set of representative flows. contract + visual + a11y layers 3. Define Contracts: States, Invariants, Keyboard, and Focus A contract test is a test for behavior that consumers rely on, independent of implementation. Think of it as “what cannot change without breaking someone.” The trick is to explicitly separate: States: the component’s supported variants and modesInvariants: what must always remain true across those states States: the component’s supported variants and modes States Invariants: what must always remain true across those states Invariants States: what to cover without exploding permutations For each component, define a small set of contract cases that represent real usage. Good state coverage usually includes: Interactive states: default, hover, active, focus-visibleDisabled and read-only: including “disabled but focusable” patterns where relevantLoading and async: loading indicator, skeleton, pending stateValidation: error and helper text, invalid state announcementsContent extremes: long labels, long values, truncation, wrappingDirection and locale: at least one RTL case; at least one “long language” caseTheme variants: light/dark/high-contrast only if the system supports them Interactive states: default, hover, active, focus-visible Interactive states Disabled and read-only: including “disabled but focusable” patterns where relevant Disabled and read-only Loading and async: loading indicator, skeleton, pending state Loading and async Validation: error and helper text, invalid state announcements Validation Content extremes: long labels, long values, truncation, wrapping Content extremes Direction and locale: at least one RTL case; at least one “long language” case Direction and locale Theme variants: light/dark/high-contrast only if the system supports them Theme variants You are not trying to snapshot every combination. You’re building a set of cases that are likely to catch drift and regression. Invariants: what to assert (the high-signal checklist) For interactive components, the invariants that matter most are: Keyboard invariants Keyboard invariants Tab reaches the component in the expected order.Enter/Space activates where appropriate.Escape cancels/closes where appropriate.Arrow keys behave as documented (e.g., list navigation).No keyboard dead-ends (focus never disappears into the void). Tab reaches the component in the expected order. Enter/Space activates where appropriate. Escape cancels/closes where appropriate. Arrow keys behave as documented (e.g., list navigation). No keyboard dead-ends (focus never disappears into the void). Focus invariants Focus invariants Focus is visible (focus ring or equivalent).On open: focus moves to the correct element (often the first focusable element or the active item).On close: focus returns to a sensible place (usually the trigger).Traps are intentional and scoped (dialogs trap; menus usually don’t). Focus is visible (focus ring or equivalent). On open: focus moves to the correct element (often the first focusable element or the active item). On close: focus returns to a sensible place (usually the trigger). Traps are intentional and scoped (dialogs trap; menus usually don’t). Semantic invariants Semantic invariants Correct role (button, dialog, listbox, etc.).An accessible name exists (labeling is not optional).States are represented (aria-expanded, aria-checked, aria-selected, aria-invalid).Relationships exist (label ↔ input, trigger ↔ popover via aria-controls or similar patterns). Correct role (button, dialog, listbox, etc.). button dialog listbox An accessible name exists (labeling is not optional). States are represented (aria-expanded, aria-checked, aria-selected, aria-invalid). aria-expanded aria-checked aria-selected aria-invalid Relationships exist (label ↔ input, trigger ↔ popover via aria-controls or similar patterns). label input aria-controls Structural invariants Structural invariants No duplicate IDs in the rendered subtree.No leaking forbidden props to the DOM.Stable test hooks exist (data-testid or equivalent), used consistently. No duplicate IDs in the rendered subtree. No leaking forbidden props to the DOM. Stable test hooks exist (data-testid or equivalent), used consistently. data-testid A useful mental rule: if a consumer would file a bug titled “This broke our flow”, it belongs in the contract. “This broke our flow” 4. Build a Contract Test Harness That Scales The reason contract testing often fails in real life is repetition. Teams write one-off tests per component until the suite becomes inconsistent, slow, and unmaintainable. The fix is a harness: a shared way to register component cases and apply shared invariants. The harness should make it easy to: enumerate “cases” (component examples)apply shared assertions (keyboard, focus, accessible name)allow component-specific assertions without copy/paste chaos enumerate “cases” (component examples) apply shared assertions (keyboard, focus, accessible name) allow component-specific assertions without copy/paste chaos Below is a minimal example. It’s deliberately generic in spirit: you can adapt it to your UI stack, whether you render to DOM, a webview layer, or a testable host environment. // contract-harness.test.ts import { render, screen } from "@testing-library/react"; import userEvent from "@testing-library/user-event"; type ContractCase = { name: string; render: () => JSX.Element; setup?: (user: ReturnType<typeof userEvent.setup>) => Promise<void> | void; assert?: (user: ReturnType<typeof userEvent.setup>) => Promise<void> | void; // Allow opt-outs when a component doesn't support certain invariants supportsKeyboardActivate?: boolean; }; async function assertHasAccessibleName() { const el = screen.getByTestId("contract-target"); const name = el.getAttribute("aria-label") || el.getAttribute("aria-labelledby") || el.textContent?.trim(); expect(name && name.length > 0).toBe(true); } async function assertFocusIsVisible(user: ReturnType<typeof userEvent.setup>) { // Move focus via keyboard to reflect real usage await user.tab(); const focused = document.activeElement as HTMLElement | null; expect(focused).toBeTruthy(); // Contract: focus must be discoverable. // In a real system, this might be a class, a data attribute, or computed style rule. expect(focused!).toHaveAttribute("data-focus-visible", "true"); } async function assertKeyboardActivates(user: ReturnType<typeof userEvent.setup>) { const el = screen.getByTestId("contract-target"); el.focus(); await user.keyboard("{Enter}"); // Contract: activation produces an observable change. // Your cases should expose a stable observable (ARIA state, dataset flag, etc.). expect(el).toHaveAttribute("data-activated", "true"); } function runContractSuite(componentName: string, cases: ContractCase[]) { describe(`${componentName} contracts`, () => { for (const c of cases) { test(c.name, async () => { const user = userEvent.setup(); render(c.render()); // Baseline invariants await assertHasAccessibleName(); await assertFocusIsVisible(user); if (c.setup) await c.setup(user); // Keyboard activation is common but not universal if (c.supportsKeyboardActivate !== false) { await assertKeyboardActivates(user); } if (c.assert) await c.assert(user); }); } }); } // Example: keep cases small and observable const ButtonCases: ContractCase[] = [ { name: "activates via keyboard", render: () => ( <button data-testid="contract-target" data-focus-visible="true" data-activated="false" onClick={(e) => (e.currentTarget.dataset.activated = "true")} > Continue </button> ), }, ]; runContractSuite("Button", ButtonCases); // contract-harness.test.ts import { render, screen } from "@testing-library/react"; import userEvent from "@testing-library/user-event"; type ContractCase = { name: string; render: () => JSX.Element; setup?: (user: ReturnType<typeof userEvent.setup>) => Promise<void> | void; assert?: (user: ReturnType<typeof userEvent.setup>) => Promise<void> | void; // Allow opt-outs when a component doesn't support certain invariants supportsKeyboardActivate?: boolean; }; async function assertHasAccessibleName() { const el = screen.getByTestId("contract-target"); const name = el.getAttribute("aria-label") || el.getAttribute("aria-labelledby") || el.textContent?.trim(); expect(name && name.length > 0).toBe(true); } async function assertFocusIsVisible(user: ReturnType<typeof userEvent.setup>) { // Move focus via keyboard to reflect real usage await user.tab(); const focused = document.activeElement as HTMLElement | null; expect(focused).toBeTruthy(); // Contract: focus must be discoverable. // In a real system, this might be a class, a data attribute, or computed style rule. expect(focused!).toHaveAttribute("data-focus-visible", "true"); } async function assertKeyboardActivates(user: ReturnType<typeof userEvent.setup>) { const el = screen.getByTestId("contract-target"); el.focus(); await user.keyboard("{Enter}"); // Contract: activation produces an observable change. // Your cases should expose a stable observable (ARIA state, dataset flag, etc.). expect(el).toHaveAttribute("data-activated", "true"); } function runContractSuite(componentName: string, cases: ContractCase[]) { describe(`${componentName} contracts`, () => { for (const c of cases) { test(c.name, async () => { const user = userEvent.setup(); render(c.render()); // Baseline invariants await assertHasAccessibleName(); await assertFocusIsVisible(user); if (c.setup) await c.setup(user); // Keyboard activation is common but not universal if (c.supportsKeyboardActivate !== false) { await assertKeyboardActivates(user); } if (c.assert) await c.assert(user); }); } }); } // Example: keep cases small and observable const ButtonCases: ContractCase[] = [ { name: "activates via keyboard", render: () => ( <button data-testid="contract-target" data-focus-visible="true" data-activated="false" onClick={(e) => (e.currentTarget.dataset.activated = "true")} > Continue </button> ), }, ]; runContractSuite("Button", ButtonCases); What makes this harness effective Shared invariants are centralized: you don’t re-invent focus tests per component.Cases are minimal: each case exposes a stable observable, which makes failures debuggable.Opt-outs are explicit: if a component doesn’t support a behavior, it’s documented in code. Shared invariants are centralized: you don’t re-invent focus tests per component. Shared invariants are centralized Cases are minimal: each case exposes a stable observable, which makes failures debuggable. Cases are minimal Opt-outs are explicit: if a component doesn’t support a behavior, it’s documented in code. Opt-outs are explicit If you want to go further, add common invariant packs like: “overlay behavior” (Escape closes, outside click closes, focus return)“list navigation” (arrow keys, selection semantics)“form field semantics” (labeling, aria-invalid, helper text relationships) “overlay behavior” (Escape closes, outside click closes, focus return) “list navigation” (arrow keys, selection semantics) “form field semantics” (labeling, aria-invalid, helper text relationships) aria-invalid That’s how you grow coverage without growing chaos. 5. Visual Regression: Scope It, Stabilize It, and Treat Baselines as Artifacts Visual regression testing catches the class of bugs that humans notice instantly and unit tests ignore entirely: misalignment, spacing drift, truncated labels, missing hover/focus states, and theming regressions. The challenge is reliability. If your visual tests are flaky, teams stop trusting them. If teams stop trusting them, they stop looking at diffs. If they stop looking at diffs, the tests become theater. Scope: snapshot what’s high-risk, not everything A good visual regression scope prioritizes: Highly reused components: buttons, inputs, selects, menus, dialogsComplex interaction surfaces: date pickers, comboboxes, nested menusToken-heavy surfaces: anything where design tokens drive spacing/typographyKnown drift magnets: layout primitives and typography components (only if widely used) Highly reused components: buttons, inputs, selects, menus, dialogs Highly reused components Complex interaction surfaces: date pickers, comboboxes, nested menus Complex interaction surfaces Token-heavy surfaces: anything where design tokens drive spacing/typography Token-heavy surfaces Known drift magnets: layout primitives and typography components (only if widely used) Known drift magnets Avoid the trap of snapshotting every permutation. Instead: choose a small set of cases per componentinclude the most failure-prone states (hover, focus-visible, disabled, error)include one “content extreme” case (long labels / long values) choose a small set of cases per component include the most failure-prone states (hover, focus-visible, disabled, error) include one “content extreme” case (long labels / long values) Flake reduction: make the renderer deterministic Most visual flake comes from nondeterminism. Kill it systematically: Disable animations and transitions in test modeFreeze time and mock locale-dependent formattingDeterministic data: no random IDs, stable content orderingFont stability: avoid network font loading; ensure consistent fonts in CIFixed viewport and DPR: keep screenshot geometry consistentWait for “settled” UI: fonts loaded, layout stable, no pending microtasksMask dynamic regions (timestamps, counters) if they can’t be stabilized Disable animations and transitions in test mode Disable animations and transitions Freeze time and mock locale-dependent formatting Freeze time Deterministic data: no random IDs, stable content ordering Deterministic data Font stability: avoid network font loading; ensure consistent fonts in CI Font stability Fixed viewport and DPR: keep screenshot geometry consistent Fixed viewport and DPR Wait for “settled” UI: fonts loaded, layout stable, no pending microtasks Wait for “settled” UI Mask dynamic regions (timestamps, counters) if they can’t be stabilized Mask dynamic regions The intent is not “pixel perfection across all machines.” The intent is “pixel stability in CI,” which gives you high-signal diffs. Baselines: handle them like infrastructure releases Baselines are not noise; they’re the reference artifacts that define “expected UI.” Practical baseline rules: Baseline updates happen only in PRs (never by pushing directly).Every baseline update requires a short explanation: “token update,” “bug fix,” “layout improvement.”Visual diffs must be reviewed by someone accountable for the system quality.Keep baselines near the cases that generated them so maintenance stays local. Baseline updates happen only in PRs (never by pushing directly). only in PRs Every baseline update requires a short explanation: “token update,” “bug fix,” “layout improvement.” Visual diffs must be reviewed by someone accountable for the system quality. Keep baselines near the cases that generated them so maintenance stays local. Visual regression works best when paired with contract testing: Contracts tell you “behavior broke.”Visual diffs tell you “appearance changed.” Contracts tell you “behavior broke.” Contracts Visual diffs tell you “appearance changed.” Visual diffs Together they tell you “this change is intended or not,” quickly. 6. Accessibility Gates: Automated Checks + Manual Keyboard Checklist + CI Policy Accessibility is not a polish layer in a component system. It’s a core contract. If the library gets accessibility wrong, every consumer inherits it, and fixing it later can be costly because it becomes a breaking change. The winning approach is a combination: Automated a11y checks for fast coverageManual keyboard checklist for interaction truthCI gating that prevents regressions without freezing progress Automated a11y checks for fast coverage Automated a11y checks Manual keyboard checklist for interaction truth Manual keyboard checklist CI gating that prevents regressions without freezing progress CI gating Automated checks: what they catch well Automated tools are good at: missing labels / empty namesinvalid ARIA roles/attributescommon semantic violations (e.g., button-like divs without roles)basic heading/landmark issues when you test within a scaffold missing labels / empty names invalid ARIA roles/attributes common semantic violations (e.g., button-like divs without roles) basic heading/landmark issues when you test within a scaffold They are not sufficient for: correct focus order across complex overlaysintent-dependent semantics (what should be a button vs. a menu item)“feels usable” outcomes correct focus order across complex overlays intent-dependent semantics (what should be a button vs. a menu item) “feels usable” outcomes Gate on new serious/critical violations (a practical policy) A common failure mode: the first time you run a11y checks, they find legacy issues. Teams panic, turn off the checks, and move on. Don’t do that. Instead, gate on new high-impact issues. That creates forward progress without blocking everything. new Here’s a pattern for that: treat a11y findings like a baseline, and fail CI only when a PR introduces new serious/critical violations for the components it touches. // a11y-regression.test.ts import { test, expect } from "@playwright/test"; import AxeBuilder from "@axe-core/playwright"; import fs from "node:fs"; type AxeViolation = { id: string; impact?: "minor" | "moderate" | "serious" | "critical"; nodes: Array<{ target: string[] }>; }; function key(v: AxeViolation) { const targets = v.nodes.flatMap((n) => n.target).join("|"); return `${v.id}:${v.impact ?? "unknown"}:${targets}`; } test("Select: no new serious/critical a11y violations", async ({ page }) => { await page.goto("http://localhost:6006/?path=/story/select--default"); // Interact if needed to reveal popover/listbox states await page.keyboard.press("Tab"); await page.keyboard.press("Enter"); const results = await new AxeBuilder({ page }) .disableRules([]) // keep empty unless you intentionally disable something .analyze(); const violations: AxeViolation[] = results.violations as any; // Load baseline of known violations (committed JSON). const baselinePath = "a11y-baselines/select-default.json"; const baseline: string[] = fs.existsSync(baselinePath) ? JSON.parse(fs.readFileSync(baselinePath, "utf-8")) : []; const current = violations.map(key); // Gate: block *new* serious/critical violations const newHighImpact = violations .filter((v) => v.impact === "serious" || v.impact === "critical") .map(key) .filter((k) => !baseline.includes(k)); expect(newHighImpact, `New high-impact a11y violations:\n${newHighImpact.join("\n")}`).toEqual([]); // Optional: enforce that baseline doesn't grow silently on main // (i.e., if current has more keys than baseline, require explicit baseline update) }); // a11y-regression.test.ts import { test, expect } from "@playwright/test"; import AxeBuilder from "@axe-core/playwright"; import fs from "node:fs"; type AxeViolation = { id: string; impact?: "minor" | "moderate" | "serious" | "critical"; nodes: Array<{ target: string[] }>; }; function key(v: AxeViolation) { const targets = v.nodes.flatMap((n) => n.target).join("|"); return `${v.id}:${v.impact ?? "unknown"}:${targets}`; } test("Select: no new serious/critical a11y violations", async ({ page }) => { await page.goto("http://localhost:6006/?path=/story/select--default"); // Interact if needed to reveal popover/listbox states await page.keyboard.press("Tab"); await page.keyboard.press("Enter"); const results = await new AxeBuilder({ page }) .disableRules([]) // keep empty unless you intentionally disable something .analyze(); const violations: AxeViolation[] = results.violations as any; // Load baseline of known violations (committed JSON). const baselinePath = "a11y-baselines/select-default.json"; const baseline: string[] = fs.existsSync(baselinePath) ? JSON.parse(fs.readFileSync(baselinePath, "utf-8")) : []; const current = violations.map(key); // Gate: block *new* serious/critical violations const newHighImpact = violations .filter((v) => v.impact === "serious" || v.impact === "critical") .map(key) .filter((k) => !baseline.includes(k)); expect(newHighImpact, `New high-impact a11y violations:\n${newHighImpact.join("\n")}`).toEqual([]); // Optional: enforce that baseline doesn't grow silently on main // (i.e., if current has more keys than baseline, require explicit baseline update) }); Manual keyboard checklist: small, required, high-signal For interactive components, require a simple checklist whenever behavior changes: Tab reaches the component reliablyFocus is visible and not hidden behind overlaysEnter/Space activates the primary actionEscape closes/cancels where appropriateFocus moves on open and returns on closeArrow keys behave as documented (list navigation, selection)No accidental focus traps (unless intentionally a modal)The primary task is completable without a pointer device Tab reaches the component reliably Focus is visible and not hidden behind overlays Enter/Space activates the primary action Escape closes/cancels where appropriate Focus moves on open and returns on close Arrow keys behave as documented (list navigation, selection) No accidental focus traps (unless intentionally a modal) The primary task is completable without a pointer device This is quick to do and catches what automated rules can’t. When CI should fail A pragmatic gating policy that works: On every PR: fail on new serious/critical a11y violations in changed component cases.On main: run a broader sweep; fail if the baseline grows unexpectedly.Nightly: run the full matrix (themes, browsers, device profiles) to catch environment-sensitive issues. On every PR: fail on new serious/critical a11y violations in changed component cases. On every PR On main: run a broader sweep; fail if the baseline grows unexpectedly. On main Nightly: run the full matrix (themes, browsers, device profiles) to catch environment-sensitive issues. Nightly The key is consistency: accessibility has to be treated like a release requirement, not a best-effort suggestion. 7. Performance Budgets: Bundle Size + Render Timing, Enforced Like Contracts Component systems tend to gain weight silently: extra dependencies, duplicated utilities, “temporary” polyfills, and unbounded icon packs. Performance budgets are how you prevent the slow boil. You generally need two kinds of budgets: Bundle size budgets (static) Bundle budgets stop dependency creep. Good budget rules: track per package (core primitives vs. complex components)track per entrypoint (so a single import doesn’t drag the world)report diffs on PRs (visibility changes behavior)escalate from “warning” to “hard fail” once stable track per package (core primitives vs. complex components) per package track per entrypoint (so a single import doesn’t drag the world) per entrypoint report diffs on PRs (visibility changes behavior) diffs on PRs escalate from “warning” to “hard fail” once stable Practical enforcement: run a bundle analyzer in CIpost a PR comment with size deltafail main merges when thresholds are exceeded without an explicit exception run a bundle analyzer in CI post a PR comment with size delta fail main merges when thresholds are exceeded without an explicit exception This isn’t about shaving bytes for sport. It’s about maintaining predictable costs for consumers. Render timing budgets (runtime) Runtime budgets should focus on regressions, not absolute numbers. The question is: “Did this component get slower than it was?” A reasonable approach: pick a small set of representative cases (a heavy overlay, a list component, a form field)measure time-to-interactive or a stable render milestone in a controlled environmentgate on relative changes (e.g., “no more than +X% regression vs baseline”) pick a small set of representative cases (a heavy overlay, a list component, a form field) measure time-to-interactive or a stable render milestone in a controlled environment gate on relative changes (e.g., “no more than +X% regression vs baseline”) Avoid making the suite too broad. A handful of well-chosen perf checks will catch most accidental regressions without creating nois8. CI Pipeline: Where Each Test Runs (So It’s Fast and Real) Your CI should be shaped by two forces: Developer feedback speed (PRs must be fast) Risk coverage (main/nightly must be deep) Here’s a clean, scalable pipeline: PR opened ----> | Lint + Typecheck | (fast) +-------------------+ | v +-------------------+ | Unit + Contracts | (medium) | (changed comps) | +-------------------+ | v +-------------------+ | A11y (gated) | (medium) | serious/critical | +-------------------+ | v +-------------------+ | Visual Regression | (slower) | scoped snapshots | +-------------------+ | v +-------------------+ | Bundle Budget | (fast/medium) +-------------------+ Merge to main --> +-------------------+ | Full Visual Suite | +-------------------+ | v +-------------------+ | Full A11y Sweep | +-------------------+ | v +-------------------+ | Perf Smoke (few) | +-------------------+ Nightly ------> +-------------------+ | Cross-browser / | | platform matrix | +-------------------+ Here’s a clean, scalable pipeline: PR opened ----> | Lint + Typecheck | (fast) +-------------------+ | v +-------------------+ | Unit + Contracts | (medium) | (changed comps) | +-------------------+ | v +-------------------+ | A11y (gated) | (medium) | serious/critical | +-------------------+ | v +-------------------+ | Visual Regression | (slower) | scoped snapshots | +-------------------+ | v +-------------------+ | Bundle Budget | (fast/medium) +-------------------+ Merge to main --> +-------------------+ | Full Visual Suite | +-------------------+ | v +-------------------+ | Full A11y Sweep | +-------------------+ | v +-------------------+ | Perf Smoke (few) | +-------------------+ Nightly ------> +-------------------+ | Cross-browser / | | platform matrix | +-------------------+ Key choices that keep this sane: • Scope by change detection: if only Button changed, don’t rerun every snapshot in the galaxy. • Run contracts early: contract failures are high-signal and usually easy to debug. • Put visual tests after contract/a11y: don’t burn minutes on screenshots if basics are broken. • Nightly matrix: where you pay the cost for cross-browser, multiple themes, larger suites. 9. Pitfalls & Fixes Pitfall: “Our contract tests are just snapshots with extra steps.”Fix: Contracts must assert behavioral guarantees (keyboard, focus, semantics). Snapshots can support, not replace. Pitfall: “Our contract tests are just snapshots with extra steps.” Fix: Contracts must assert behavioral guarantees (keyboard, focus, semantics). Snapshots can support, not replace. Pitfall: Visual tests are flaky, so nobody trusts them.Fix: Stabilize the environment (disable animations, freeze time, deterministic data), and reduce scope to high-risk states. Pitfall: Visual tests are flaky, so nobody trusts them. Fix: Stabilize the environment (disable animations, freeze time, deterministic data), and reduce scope to high-risk states. Pitfall: A11y checks fail constantly, so they get turned off.Fix: Gate only on new serious/critical issues first. Baseline existing debt, then ratchet quality upward. Pitfall: A11y checks fail constantly, so they get turned off. Fix: Gate only on new serious/critical issues first. Baseline existing debt, then ratchet quality upward. Pitfall: Baseline updates become political.Fix: Require a short “why this changed” note in the PR, plus a reviewer who owns design/system quality. Pitfall: Baseline updates become political. Fix: Require a short “why this changed” note in the PR, plus a reviewer who owns design/system quality. Pitfall: Tests are too slow, so teams bypass them.Fix: Use change-based scoping on PRs. Move the heavy matrix to nightly. Keep PR feedback under a predictable ceiling. Pitfall: Tests are too slow, so teams bypass them. Fix: Use change-based scoping on PRs. Move the heavy matrix to nightly. Keep PR feedback under a predictable ceiling. Pitfall: Performance budgets create noise.Fix: Gate on meaningful deltas and a small set of representative entrypoints. Report budgets in PR comments to create visibility. Pitfall: Performance budgets create noise. Fix: Gate on meaningful deltas and a small set of representative entrypoints. Report budgets in PR comments to create visibility. Pitfall: Consumers still get surprised by breaking changes.Fix: Tie contract suites to semantic versioning rules. If the contract changes, it’s a breaking change—period. Pitfall: Consumers still get surprised by breaking changes. Fix: Tie contract suites to semantic versioning rules. If the contract changes, it’s a breaking change—period. 10. Adoption Checklist Use this as a rollout plan that won’t melt your calendar. • Define 5–10 highest-risk components (dialogs, menus, selects, inputs) as the first wave. • For each, write contract cases that cover key states + edge cases (long text, disabled, error). • Implement a shared contract harness (keyboard, focus, semantics) and require new components to plug into it. • Add visual regression for those components only; stabilize the environment (no animations, frozen time, deterministic fixtures). • Add a11y automation (axe or equivalent) and gate on new serious/critical violations in changed components. • Create a manual keyboard checklist and require it for interactive component changes (in PR template or review rubric). • Add bundle size reporting in CI; start with soft warnings, then graduate to a hard budget for main. • Add a small perf smoke (a few representative component cases) and gate on regression deltas. • Move cross-browser / platform matrix to nightly once PR time is under control. • Document “contract change = breaking change” and enforce it in code review. Conclusion Treat your component system like a critical service: define contracts, test behavior at the boundaries, and put automated gates in CI so regressions can’t merge. The goal isn’t “more tests”—it’s higher confidence per change. Start by locking down a small set of high-leverage checks (a11y assertions, visual diffs, and contract tests for props/state) and make them fast and mandatory. Once those are stable, expand coverage through generated test matrices and reusable harnesses, not manual one-off snapshots.