Agentic Design Quality Control
Control generic AI UI tendencies with context, critique loops, and evaluation gates.
Key takeaways
- Generic AI UI (hero sections, nested cards, weak density) is usually a context and evaluation problem, not just a model limitation.
- Counter typical biases by stating the product surface type, defining card rules, providing semantic color roles, and requiring responsive checks.
- Feed a context stack of DESIGN.md, component specs, token docs, example screens, and a review checklist before generation.
- Run a critique loop with a concrete rubric covering surface fit, component reuse, token discipline, density, accessibility, and visual restraint.
- Use a Playwright screenshot loop so evaluation inspects pixels at desktop and mobile widths, since code can type-check while the layout still overflows.
When Codex, Claude Code, or similar agents build UI, they often drift toward generic screens: oversized hero sections, decorative cards, gradients, weak hierarchy, and inconsistent spacing. This is not a model problem alone. It is usually a context and evaluation problem.
Typical AI UI Biases
| Bias | Symptom | Control |
|---|---|---|
| Marketing default | Every app starts with a hero | State the product surface type. |
| Card overload | Cards inside cards, decorative panels | Define card usage rules. |
| One-note palette | Same hue everywhere | Provide semantic color roles. |
| Weak density | Tool UIs become landing pages | Provide dashboard and workflow examples. |
| Text overflow | Buttons and cards break on mobile | Require responsive screenshot checks. |
Context Stack
DESIGN.md: product feel and visual philosophy.- Component specs: props, states, variants, slots.
- Token docs: semantic values and forbidden raw values.
- Example screens: approved density, hierarchy, and responsive behavior.
- Review checklist: screenshot, accessibility, layout, and text-fit checks.
Critique Loop
Generate -> screenshot -> critique against DESIGN.md -> revise -> run checks -> summarize evidenceThe important part is the critique rubric. Do not ask only "make it better." Ask the agent to check surface type, hierarchy, density, token usage, mobile layout, and forbidden visual patterns.
Review Rubric
Use a small, repeatable rubric so agent revisions do not become subjective polishing.
| Dimension | Pass condition |
|---|---|
| Surface fit | The screen matches the actual product surface type. |
| Component reuse | Existing primitives and composites are used before new UI is created. |
| Token discipline | Colors, spacing, radius, typography, and motion come from approved tokens. |
| Density | Information density matches the user workflow and viewport. |
| Accessibility | Labels, focus, keyboard behavior, and contrast are not deferred. |
| Responsiveness | Long labels, tables, buttons, and controls fit at target breakpoints. |
| Visual restraint | Decorative elements serve the task instead of filling space. |
| Evidence | The agent reports commands, screenshots, and remaining assumptions. |
Playwright Screenshot Loop
For meaningful UI changes, the evaluation loop should inspect pixels, not only code.
1. Start the dev server.
2. Open the changed route at desktop width.
3. Capture screenshot and console errors.
4. Repeat at a narrow mobile width.
5. Check text fit, overlapping elements, focus states, and empty/loading/error states.
6. Revise until the screenshots satisfy the rubric.The screenshot loop is especially important for AI output because the code can type-check while the interface still has overflowing labels, broken density, or a marketing layout where an operational tool was requested.
Agent Instruction Template
Before editing UI:
- Read DESIGN.md and the relevant component specs.
- Identify the product surface type and density target.
- Use existing components and semantic tokens.
- Do not invent variants, decorative wrappers, or raw style values.
After editing UI:
- Run the relevant typecheck or lint command.
- Capture desktop and mobile screenshots.
- Compare the result against the review rubric.
- Report remaining assumptions and missing design-system coverage.Acceptance Criteria
- Uses existing components and tokens.
- Matches the surface type: tool, dashboard, docs, marketing page, or game.
- Has no text overflow at target breakpoints.
- Has visible focus and accessible labels.
- Has screenshot evidence for UI changes.