Agentic Design Quality Control

Control generic AI UI tendencies with context, critique loops, and evaluation gates.

Key takeaways

Generic AI UI (hero sections, nested cards, weak density) is usually a context and evaluation problem, not just a model limitation.
Counter typical biases by stating the product surface type, defining card rules, providing semantic color roles, and requiring responsive checks.
Feed a context stack of DESIGN.md, component specs, token docs, example screens, and a review checklist before generation.
Run a critique loop with a concrete rubric covering surface fit, component reuse, token discipline, density, accessibility, and visual restraint.
Use a Playwright screenshot loop so evaluation inspects pixels at desktop and mobile widths, since code can type-check while the layout still overflows.

When Codex, Claude Code, or similar agents build UI, they often drift toward generic screens: oversized hero sections, decorative cards, gradients, weak hierarchy, and inconsistent spacing. This is not a model problem alone. It is usually a context and evaluation problem.

Typical AI UI Biases

Bias	Symptom	Control
Marketing default	Every app starts with a hero	State the product surface type.
Card overload	Cards inside cards, decorative panels	Define card usage rules.
One-note palette	Same hue everywhere	Provide semantic color roles.
Weak density	Tool UIs become landing pages	Provide dashboard and workflow examples.
Text overflow	Buttons and cards break on mobile	Require responsive screenshot checks.

Context Stack

DESIGN.md: product feel and visual philosophy.
Component specs: props, states, variants, slots.
Token docs: semantic values and forbidden raw values.
Example screens: approved density, hierarchy, and responsive behavior.
Review checklist: screenshot, accessibility, layout, and text-fit checks.

Critique Loop

Generate -> screenshot -> critique against DESIGN.md -> revise -> run checks -> summarize evidence

The important part is the critique rubric. Do not ask only "make it better." Ask the agent to check surface type, hierarchy, density, token usage, mobile layout, and forbidden visual patterns.

Review Rubric

Use a small, repeatable rubric so agent revisions do not become subjective polishing.

Dimension	Pass condition
Surface fit	The screen matches the actual product surface type.
Component reuse	Existing primitives and composites are used before new UI is created.
Token discipline	Colors, spacing, radius, typography, and motion come from approved tokens.
Density	Information density matches the user workflow and viewport.
Accessibility	Labels, focus, keyboard behavior, and contrast are not deferred.
Responsiveness	Long labels, tables, buttons, and controls fit at target breakpoints.
Visual restraint	Decorative elements serve the task instead of filling space.
Evidence	The agent reports commands, screenshots, and remaining assumptions.

Playwright Screenshot Loop

For meaningful UI changes, the evaluation loop should inspect pixels, not only code.

1. Start the dev server.
2. Open the changed route at desktop width.
3. Capture screenshot and console errors.
4. Repeat at a narrow mobile width.
5. Check text fit, overlapping elements, focus states, and empty/loading/error states.
6. Revise until the screenshots satisfy the rubric.

The screenshot loop is especially important for AI output because the code can type-check while the interface still has overflowing labels, broken density, or a marketing layout where an operational tool was requested.

Agent Instruction Template

Before editing UI:
- Read DESIGN.md and the relevant component specs.
- Identify the product surface type and density target.
- Use existing components and semantic tokens.
- Do not invent variants, decorative wrappers, or raw style values.

After editing UI:
- Run the relevant typecheck or lint command.
- Capture desktop and mobile screenshots.
- Compare the result against the review rubric.
- Report remaining assumptions and missing design-system coverage.

Acceptance Criteria

Uses existing components and tokens.
Matches the surface type: tool, dashboard, docs, marketing page, or game.
Has no text overflow at target breakpoints.
Has visible focus and accessible labels.
Has screenshot evidence for UI changes.

Agentic Design Quality Control

Control generic AI UI tendencies with context, critique loops, and evaluation gates.

Key takeaways

Generic AI UI (hero sections, nested cards, weak density) is usually a context and evaluation problem, not just a model limitation.
Counter typical biases by stating the product surface type, defining card rules, providing semantic color roles, and requiring responsive checks.
Feed a context stack of DESIGN.md, component specs, token docs, example screens, and a review checklist before generation.
Run a critique loop with a concrete rubric covering surface fit, component reuse, token discipline, density, accessibility, and visual restraint.
Use a Playwright screenshot loop so evaluation inspects pixels at desktop and mobile widths, since code can type-check while the layout still overflows.

Typical AI UI Biases

Bias	Symptom	Control
Marketing default	Every app starts with a hero	State the product surface type.
Card overload	Cards inside cards, decorative panels	Define card usage rules.
One-note palette	Same hue everywhere	Provide semantic color roles.
Weak density	Tool UIs become landing pages	Provide dashboard and workflow examples.
Text overflow	Buttons and cards break on mobile	Require responsive screenshot checks.

Context Stack

DESIGN.md: product feel and visual philosophy.
Component specs: props, states, variants, slots.
Token docs: semantic values and forbidden raw values.
Example screens: approved density, hierarchy, and responsive behavior.
Review checklist: screenshot, accessibility, layout, and text-fit checks.

Critique Loop

Generate -> screenshot -> critique against DESIGN.md -> revise -> run checks -> summarize evidence

The important part is the critique rubric. Do not ask only "make it better." Ask the agent to check surface type, hierarchy, density, token usage, mobile layout, and forbidden visual patterns.

Review Rubric

Use a small, repeatable rubric so agent revisions do not become subjective polishing.

Dimension	Pass condition
Surface fit	The screen matches the actual product surface type.
Component reuse	Existing primitives and composites are used before new UI is created.
Token discipline	Colors, spacing, radius, typography, and motion come from approved tokens.
Density	Information density matches the user workflow and viewport.
Accessibility	Labels, focus, keyboard behavior, and contrast are not deferred.
Responsiveness	Long labels, tables, buttons, and controls fit at target breakpoints.
Visual restraint	Decorative elements serve the task instead of filling space.
Evidence	The agent reports commands, screenshots, and remaining assumptions.

Playwright Screenshot Loop

For meaningful UI changes, the evaluation loop should inspect pixels, not only code.

1. Start the dev server.
2. Open the changed route at desktop width.
3. Capture screenshot and console errors.
4. Repeat at a narrow mobile width.
5. Check text fit, overlapping elements, focus states, and empty/loading/error states.
6. Revise until the screenshots satisfy the rubric.

Agent Instruction Template

Before editing UI:
- Read DESIGN.md and the relevant component specs.
- Identify the product surface type and density target.
- Use existing components and semantic tokens.
- Do not invent variants, decorative wrappers, or raw style values.

After editing UI:
- Run the relevant typecheck or lint command.
- Capture desktop and mobile screenshots.
- Compare the result against the review rubric.
- Report remaining assumptions and missing design-system coverage.

Acceptance Criteria

Uses existing components and tokens.
Matches the surface type: tool, dashboard, docs, marketing page, or game.
Has no text overflow at target breakpoints.
Has visible focus and accessible labels.
Has screenshot evidence for UI changes.

Agentic Design Quality Control

Typical AI UI Biases

Context Stack

Critique Loop

Review Rubric

Playwright Screenshot Loop

Agent Instruction Template

Acceptance Criteria

On This Page

Agentic Design Quality Control

Typical AI UI Biases

Context Stack

Critique Loop

Review Rubric

Playwright Screenshot Loop

Agent Instruction Template

Acceptance Criteria

On This Page