프로덕션 운영 패턴

모델 라우팅, 캐싱, 관측성(트레이스/스팬), 비용 추적, SLO 정의, 보안/거버넌스, 가드레일, 스케일링

핵심 요약

프로덕션 기준은 동작 여부가 아니라 예측 가능한 비용·지연, 감사 가능성, 안전 경계 안에서 동작하는지이며, 시스템을 Request Gate·Orchestration·Model·Tool·Observability 레이어로 나눕니다.
관측성은 request_id를 parent→child agent로 전파하고 OpenTelemetry GenAI Semantic Conventions를 따르며, trace는 일반 10~20% head-based + 에러·고비용·고위험 100% tail-based의 Hybrid sampling을 권장합니다.
비용은 에이전트별 token budget을 사전 분배(재시도 여유 15~20%)하고 prompt caching(최대 8589% 절감), 다단계 비용 알림(info/warning/critical→throttle/halt)으로 통제합니다.
보안은 최소 권한 원칙으로 read/write/admin tier를 격리하고, 모든 행동을 who/what/when/context 구조의 불변 audit log로 기록하며 금융·의료·개인정보 도메인별 compliance를 확인합니다.
SLO는 지연·성공률·비용 세 축으로 정의하고 위반 시 circuit breaker·model downgrade·optional agent 비활성화를 자동 실행하며, error budget 소진 시 배포를 동결합니다.

프로덕션에서 따져야 할 것은 "동작하느냐"가 아니라 예측 가능한 비용과 지연, 감사 가능성, 안전 경계 안에서 동작하느냐입니다. 오케스트레이션이 운영에 들어가면 모델 품질만큼 trace, policy, rollout 전략이 중요해집니다.

운영 아키텍처에서 필요한 레이어

레이어	책임	대표 통제
Request Gate	인증, rate limit, 기본 정책	tenant quota, abuse 방지
Orchestration	라우팅, 상태 전이, 승인	workflow, checkpoint
Model Layer	모델 선택과 fallback	cost/latency routing
Tool Layer	외부 시스템 접근	권한 분리, audit
Observability	trace, span, cost, incidents	correlation id

프로덕션 아키텍처 레이어

모델 라우팅 전략

상황	권장 전략
단순 분류/필터링	저비용, 저지연 모델
계획 수립과 복합 추론	고품질 모델
evaluator	일관성 높은 모델 또는 rule hybrid
대규모 배치	cache + 낮은 비용 모델 우선

관건은 "항상 가장 좋은 모델"이 아니라 step별로 필요한 품질을 맞추는 것입니다.

캐싱 포인트

캐시 대상	적합한 조건	주의점
retrieval 결과	자주 조회되고 자주 바뀌지 않음	freshness 정책 필요
분류 결과	입력 정규화가 가능함	drift 시 재평가 필요
expensive eval	같은 케이스 반복 평가	모델 변경 시 invalidation 필요
공통 요약	팀 공통 문서 요약	출처와 버전 저장

관측성

프로덕션에서 trace가 없으면 멀티에이전트 시스템은 사실상 디버깅할 수 없습니다. 적어도 아래 필드는 end-to-end로 연결해야 합니다.

request_id
run_id
agent_id
tool_name
model_name
latency_ms
input_tokens
output_tokens
cost_estimate
risk_level

Trace Instrumentation 심화

멀티에이전트 시스템의 관측성은 단순 로깅을 넘어 구조화된 트레이스 전파가 핵심입니다. parent agent가 child agent를 호출할 때 context가 끊기면 장애 원인을 추적하지 못합니다.

request_id 전파 패턴

parent에서 child로 에이전트를 호출할 때는 trace context를 반드시 함께 넘겨야 합니다.

// trace context 전파 구조
interface TraceContext {
  traceId: string       // 요청 전체를 관통하는 ID
  spanId: string        // 현재 작업 단위 ID
  parentSpanId?: string // 부모 span ID (root span이면 없음)
  baggage: Record<string, string> // 추가 전파 데이터 (tenant, env 등)
}

function propagateContext(
  parentCtx: TraceContext,
  operationName: string
): TraceContext {
  return {
    traceId: parentCtx.traceId,
    spanId: crypto.randomUUID(),
    parentSpanId: parentCtx.spanId,
    baggage: { ...parentCtx.baggage },
  }
}

// parent → child agent 호출 시
async function delegateToChild(
  childAgent: Agent,
  task: string,
  parentCtx: TraceContext
) {
  const childCtx = propagateContext(parentCtx, `agent.${childAgent.name}`)
  const span = tracer.startSpan(childCtx, {
    'agent.name': childAgent.name,
    'agent.role': childAgent.role,
  })

  try {
    const result = await childAgent.execute(task, childCtx)
    span.setStatus('ok')
    return result
  } catch (error) {
    span.setStatus('error')
    span.recordException(error)
    throw error
  } finally {
    span.end()
  }
}

OpenTelemetry GenAI Semantic Conventions 기반 span 생성

OpenTelemetry의 GenAI Semantic Conventions를 따르면 여러 관측 도구와 호환됩니다.

import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api'

const tracer = trace.getTracer('agent-orchestrator', '1.0.0')

async function tracedLLMCall(
  model: string,
  messages: Message[],
  parentCtx: TraceContext
) {
  const span = tracer.startSpan(
    'gen_ai.chat',
    {
      kind: SpanKind.CLIENT,
      attributes: {
        // GenAI Semantic Conventions
        'gen_ai.system': 'anthropic',
        'gen_ai.request.model': model,
        'gen_ai.request.max_tokens': 4096,
        'gen_ai.request.temperature': 0.7,

        // 에이전트 오케스트레이션 커스텀 속성
        'agent.id': parentCtx.baggage['agent_id'],
        'agent.step': parentCtx.baggage['step_index'],
        'orchestration.run_id': parentCtx.baggage['run_id'],
      },
    },
    parentCtx
  )

  try {
    const response = await llm.chat({ model, messages })

    // 응답 메트릭 기록
    span.setAttributes({
      'gen_ai.response.model': response.model,
      'gen_ai.response.finish_reason': response.stopReason,
      'gen_ai.usage.input_tokens': response.usage.inputTokens,
      'gen_ai.usage.output_tokens': response.usage.outputTokens,
      'gen_ai.cost.estimate_usd': calculateCost(
        model,
        response.usage.inputTokens,
        response.usage.outputTokens
      ),
    })

    span.setStatus({ code: SpanStatusCode.OK })
    return response
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
    span.recordException(error)
    throw error
  } finally {
    span.end()
  }
}

Trace Sampling 전략

모든 trace를 저장하면 비용이 폭증합니다. 멀티에이전트 시스템에 맞는 sampling 전략을 골라야 합니다.

전략	동작 방식	장점	단점
Head-based	요청 진입 시점에 샘플링 결정	구현이 단순, 오버헤드 최소	중요한 에러 trace를 놓칠 수 있음
Tail-based	전체 trace 완료 후 샘플링 결정	에러/지연 이상치를 확실히 포착	메모리 부담, 수집기 복잡도 증가
Hybrid	head-based 기본 + 에러/고비용은 무조건 수집	비용과 가시성의 균형	설정 관리 복잡도

권장 전략

멀티에이전트 시스템에서는 Hybrid 방식을 권장합니다. 일반 요청은 10~20% head-based sampling으로 비용을 관리하고, 에러 발생, SLO 위반, 비용 임계값 초과 시에는 tail-based로 100% 수집합니다.

// Hybrid sampling 설정 예시
const samplerConfig = {
  // 기본: 10% head-based sampling
  defaultRate: 0.1,

  // 다음 조건에 해당하면 무조건 수집
  alwaysSample: {
    onError: true,                    // 에러 발생 시
    onHighCost: { thresholdUsd: 1.0 }, // 비용 임계값 초과 시
    onSlowResponse: { thresholdMs: 30_000 }, // 30초 초과 시
    onHumanEscalation: true,          // 사람 승인 요청 시
    onHighRisk: true,                 // risk_level === 'high'
  },

  // agent 역할별 샘플링 오버라이드
  perAgent: {
    'write-executor': 1.0,  // write agent는 항상 수집
    'classifier': 0.05,     // 분류기는 5%로 줄임
  },
}

Cost Tracking 실제 계산

멀티에이전트 시스템의 비용은 단일 LLM 호출과 비교가 안 될 만큼 빠르게 늘어납니다. 에이전트가 3~~5개 연쇄되고 각각 2~~3번씩 LLM을 호출하면, 사용자 요청 하나가 모델 호출 10회 이상으로 불어납니다. 미리 예산을 나눠 두고 실시간으로 추적하지 않으면 비용이 통제를 벗어납니다.

Token Budget 분배 전략

요청 하나마다 에이전트별 token budget을 미리 나눠 둡니다.

interface AgentBudget {
  agentId: string
  maxInputTokens: number
  maxOutputTokens: number
  maxCostUsd: number
  model: string
  priority: 'critical' | 'normal' | 'optional'
}

interface RunBudgetPlan {
  totalMaxCostUsd: number
  agents: AgentBudget[]
  reservePercent: number  // 재시도 여유분 (권장 15~20%)
}

// 예시: 복합 분석 파이프라인의 budget 분배
const analysisPlan: RunBudgetPlan = {
  totalMaxCostUsd: 0.50,
  reservePercent: 0.15,
  agents: [
    {
      agentId: 'classifier',
      maxInputTokens: 2_000,
      maxOutputTokens: 500,
      maxCostUsd: 0.01,
      model: 'claude-haiku',
      priority: 'critical',
    },
    {
      agentId: 'researcher',
      maxInputTokens: 50_000,
      maxOutputTokens: 4_000,
      maxCostUsd: 0.20,
      model: 'claude-sonnet',
      priority: 'critical',
    },
    {
      agentId: 'synthesizer',
      maxInputTokens: 30_000,
      maxOutputTokens: 8_000,
      maxCostUsd: 0.15,
      model: 'claude-sonnet',
      priority: 'critical',
    },
    {
      agentId: 'quality-checker',
      maxInputTokens: 10_000,
      maxOutputTokens: 2_000,
      maxCostUsd: 0.05,
      model: 'claude-haiku',
      priority: 'normal',
    },
  ],
}

Prompt Caching ROI 계산

Anthropic의 prompt caching을 쓰면 반복되는 system prompt와 긴 context의 비용이 크게 줄어듭니다.

항목	캐시 미사용	캐시 사용	절감
System prompt (2,000 tokens) x 100회	$0.60	$0.066 (캐시 적중 시 90% 할인)	~89%
RAG context (10,000 tokens) x 50회	$1.50	$0.195	~87%
총 일일 비용 (1,000 요청)	$21.00	$3.15	~85%

// prompt caching ROI 계산기
function calculateCachingROI(params: {
  systemPromptTokens: number
  avgContextTokens: number
  dailyRequests: number
  cacheHitRate: number      // 예상 캐시 적중률 (0~1)
  inputPricePerMToken: number   // $/1M tokens
  cachePricePerMToken: number   // 캐시 적중 시 가격
  cacheWritePricePerMToken: number // 캐시 기록 시 가격
}) {
  const totalTokens = params.systemPromptTokens + params.avgContextTokens
  const dailyTokens = totalTokens * params.dailyRequests

  // 캐시 미사용 비용
  const noCacheCost =
    (dailyTokens / 1_000_000) * params.inputPricePerMToken

  // 캐시 사용 비용
  const hitTokens = dailyTokens * params.cacheHitRate
  const missTokens = dailyTokens * (1 - params.cacheHitRate)
  const cacheCost =
    (hitTokens / 1_000_000) * params.cachePricePerMToken +
    (missTokens / 1_000_000) * params.cacheWritePricePerMToken

  return {
    dailySavings: noCacheCost - cacheCost,
    savingsPercent: ((noCacheCost - cacheCost) / noCacheCost) * 100,
    monthlyROI: (noCacheCost - cacheCost) * 30,
  }
}

Parallel vs Sequential 비용 트레이드오프

에이전트를 병렬로 실행하면 지연 시간은 줄지만, 일부 에이전트의 결과가 쓸모없어지면서 비용이 낭비됩니다. 순차 실행은 비용은 아끼지만 지연이 쌓입니다.

구조	예상 지연	예상 비용	적합한 상황
완전 순차	모든 agent 지연의 합	최소 (불필요 호출 없음)	예산이 제한적, 각 단계가 이전 결과에 의존
완전 병렬	가장 느린 agent의 지연	최대 (모두 실행)	SLA가 빡빡, 각 agent가 독립적
하이브리드	중간	중간	일부는 독립, 일부는 의존

// 실행 전략별 비용/지연 시뮬레이션
function simulateExecution(
  agents: AgentBudget[],
  strategy: 'sequential' | 'parallel' | 'hybrid',
  dependencyGraph: Map<string, string[]>
) {
  if (strategy === 'sequential') {
    return {
      estimatedLatencyMs: agents.reduce((sum, a) => sum + a.avgLatencyMs, 0),
      estimatedCost: agents.reduce((sum, a) => sum + a.maxCostUsd, 0),
      wastedCost: 0, // 불필요한 호출 없음
    }
  }

  if (strategy === 'parallel') {
    const independentAgents = agents.filter(
      (a) => !dependencyGraph.has(a.agentId)
    )
    return {
      estimatedLatencyMs: Math.max(...agents.map((a) => a.avgLatencyMs)),
      estimatedCost: agents.reduce((sum, a) => sum + a.maxCostUsd, 0),
      wastedCost: independentAgents.length * 0.02, // 불필요 결과 비용 추정
    }
  }
}

비용 알림 임계값 설정

비용 폭주를 미리 잡아내려면 다단계 알림 체계가 필요합니다.

interface CostAlert {
  level: 'info' | 'warning' | 'critical'
  thresholdUsd: number
  action: 'log' | 'notify' | 'throttle' | 'halt'
  cooldownMinutes: number
}

const costAlertPolicy: CostAlert[] = [
  {
    level: 'info',
    thresholdUsd: 0.50,        // 요청 1건당 $0.50 초과
    action: 'log',
    cooldownMinutes: 0,
  },
  {
    level: 'warning',
    thresholdUsd: 2.00,        // 요청 1건당 $2.00 초과
    action: 'notify',          // Slack/PagerDuty 알림
    cooldownMinutes: 5,
  },
  {
    level: 'critical',
    thresholdUsd: 5.00,        // 요청 1건당 $5.00 초과
    action: 'throttle',        // 저비용 모델로 자동 전환
    cooldownMinutes: 1,
  },
  {
    level: 'critical',
    thresholdUsd: 10.00,       // 요청 1건당 $10.00 초과
    action: 'halt',            // 즉시 실행 중단
    cooldownMinutes: 0,
  },
]

// 일/주/월 단위 총액 알림도 별도 설정
const aggregateCostPolicy = {
  daily: { warningUsd: 100, criticalUsd: 300 },
  weekly: { warningUsd: 500, criticalUsd: 1500 },
  monthly: { warningUsd: 2000, criticalUsd: 5000 },
}

가드레일 설계

가드레일 유형	예시
입력 가드레일	민감정보 마스킹, 허용 도메인 검사
출력 가드레일	PII 노출 차단, 금지 문구 검사
행동 가드레일	특정 write tool은 승인 없이는 금지
비용 가드레일	request당 max step, max token budget

가드레일은 모델 뒤에 붙는 필터가 아니라 실행 전·중·후 전체를 아우르는 정책 층으로 봐야 합니다.

엔터프라이즈 운영 기반

2026년 MCP 로드맵의 Enterprise Readiness와 A2A v1.0.0의 인증 체계를 반영한 프로덕션 운영 기반입니다.

영역	구현 패턴	비고
감사 추적 (Audit Trail)	MCP 서버/A2A task 레벨 이벤트 로깅, 불변 저장소	규제 대응 필수
SSO 통합	MCP Server Cards의 OAuth2 인증 + A2A Agent Card auth	IdP 통합 단일화
API 게이트웨이	MCP Server Cards + A2A Agent Card 기반 라우팅, rate limit, 정책 적용	도구와 에이전트 검색을 단일 게이트웨이에서 처리
멀티테넌트 격리	tenant별 MCP 서버 인스턴스 또는 namespace 분리	Streamable HTTP 수평 확장 활용

Agent Security / Governance

멀티에이전트 시스템이 프로덕션에 들어가면 보안과 거버넌스가 기술 성능만큼 중요해집니다. 에이전트는 도구를 호출하고 데이터를 읽고 외부 시스템에 쓸 수 있습니다. 무엇을 할 수 있고 무엇을 했는지 추적하지 않으면 사고가 터졌을 때 원인 파악도, 규제 대응도 할 수 없습니다.

권한 격리 패턴 (최소 권한 원칙)

각 에이전트는 자기 역할에 필요한 최소한의 권한만 가져야 합니다. 에이전트 하나가 모든 도구에 접근할 수 있으면, 프롬프트 주입이나 로직 오류 하나로 시스템 전체가 위험에 빠집니다.

// 에이전트별 권한 정의
interface AgentPermissions {
  agentId: string
  allowedTools: ToolPermission[]
  dataAccess: DataAccessPolicy
  networkAccess: NetworkPolicy
  maxTokenBudget: number
  requiresApproval: boolean
}

interface ToolPermission {
  toolName: string
  operations: ('read' | 'write' | 'delete')[]
  resourcePattern: string  // glob 패턴 (예: "docs/*", "users/self")
  rateLimit?: { maxPerMinute: number }
}

interface DataAccessPolicy {
  allowedTables: string[]
  allowedColumns: string[]     // PII 컬럼 제외 가능
  rowFilter?: string           // tenant 격리 등
  maxRowsPerQuery: number
}

interface NetworkPolicy {
  allowedDomains: string[]     // 접근 가능한 외부 도메인
  blockedDomains: string[]
  maxRequestsPerMinute: number
}

// 예시: Researcher agent — read-only, 외부 API 접근 가능
const researcherPermissions: AgentPermissions = {
  agentId: 'researcher',
  allowedTools: [
    {
      toolName: 'web_search',
      operations: ['read'],
      resourcePattern: '*',
      rateLimit: { maxPerMinute: 30 },
    },
    {
      toolName: 'database_query',
      operations: ['read'],
      resourcePattern: 'public.*',
    },
  ],
  dataAccess: {
    allowedTables: ['documents', 'knowledge_base'],
    allowedColumns: ['*'],
    maxRowsPerQuery: 100,
  },
  networkAccess: {
    allowedDomains: ['api.search.com', 'docs.internal.com'],
    blockedDomains: ['*.payment.*'],
    maxRequestsPerMinute: 60,
  },
  maxTokenBudget: 50_000,
  requiresApproval: false,
}

// 예시: Writer agent — write 가능, 반드시 승인 필요
const writerPermissions: AgentPermissions = {
  agentId: 'writer',
  allowedTools: [
    {
      toolName: 'database_query',
      operations: ['read', 'write'],
      resourcePattern: 'drafts/*',
    },
    {
      toolName: 'publish',
      operations: ['write'],
      resourcePattern: 'content/*',
    },
  ],
  dataAccess: {
    allowedTables: ['drafts', 'content'],
    allowedColumns: ['*'],
    rowFilter: 'tenant_id = :current_tenant',
    maxRowsPerQuery: 10,
  },
  networkAccess: {
    allowedDomains: [],
    blockedDomains: ['*'],
    maxRequestsPerMinute: 0,
  },
  maxTokenBudget: 30_000,
  requiresApproval: true,
}

감사 데이터 구조 (Audit Trail)

모든 에이전트 행동은 누가(who), 무엇을(what), 언제(when), 어떤 맥락에서(context) 수행했는지 기록해야 합니다.

interface AuditRecord {
  // WHO — 행위 주체
  who: {
    agentId: string
    agentRole: string
    parentAgentId?: string  // 위임한 상위 에이전트
    userId: string          // 요청을 시작한 사용자
    tenantId: string
  }

  // WHAT — 수행한 행위
  what: {
    action: string          // 'tool_call', 'llm_call', 'approval_request' 등
    toolName?: string
    operation?: 'read' | 'write' | 'delete'
    resourceId?: string     // 대상 리소스
    inputSummary: string    // 입력 요약 (PII 마스킹 적용)
    outputSummary: string   // 출력 요약 (PII 마스킹 적용)
    tokensUsed?: { input: number; output: number }
    costUsd?: number
  }

  // WHEN — 시점
  when: {
    timestamp: string       // ISO 8601
    durationMs: number
    traceId: string
    spanId: string
  }

  // CONTEXT — 맥락
  context: {
    requestId: string
    runId: string
    stepIndex: number
    riskLevel: 'low' | 'medium' | 'high'
    policyVersion: string   // 적용된 정책 버전
    modelName: string
    approvalStatus?: 'pending' | 'approved' | 'rejected'
    approvedBy?: string
  }
}

// 감사 로그 저장소 인터페이스
interface AuditStore {
  // 불변 저장 (append-only)
  append(record: AuditRecord): Promise<void>

  // 조회 (규제 대응, 사고 조사)
  query(filter: {
    userId?: string
    agentId?: string
    action?: string
    timeRange?: { from: string; to: string }
    riskLevel?: string
  }): Promise<AuditRecord[]>

  // 보존 정책
  retentionPolicy: {
    minRetentionDays: number  // 규제 요건에 따라 결정
    immutable: boolean        // 변조 불가 여부
  }
}

Compliance 체크리스트

도메인별 규제 요건에 따라 에이전트 시스템이 충족해야 할 항목이 달라집니다.

항목	요건	구현 방법
감사 추적	모든 의사결정 과정 기록 보존 (최소 5~10년)	append-only audit store + 장기 아카이브
설명 가능성	고객에게 의사결정 근거 설명 가능해야 함	LLM reasoning chain 저장, 결정 근거 로그
접근 통제	역할 기반 접근 제어 (RBAC)	AgentPermissions + approval gate
데이터 격리	고객 데이터 간 격리	tenant별 namespace, row-level security
이상 탐지	비정상 거래/패턴 감지	비용 알림 + 행동 이상 감지
모델 검증	AI 모델 사용 시 내부 검증 절차	정기적 평가 파이프라인, 모델 변경 승인

항목	요건	구현 방법
PHI 보호	환자 건강 정보 암호화, 접근 제한	PII/PHI 마스킹, 암호화 저장, 접근 로그
최소 필요 정보	업무 수행에 필요한 최소 정보만 접근	AgentPermissions의 allowedColumns 제한
감사 로그	누가 어떤 환자 정보에 접근했는지 기록	AuditRecord + 6년 이상 보존
동의 관리	환자 동의 없이 데이터 처리 금지	동의 상태 확인 미들웨어
AI 판단 고지	AI가 관여한 판단임을 환자에게 고지	출력에 AI 관여 표시 필수
비상 접근	긴급 시 break-glass 절차	비상 접근 후 사후 감사 필수

항목	요건	구현 방법
수집 최소화	목적에 필요한 최소 개인정보만 수집	입력 가드레일에서 불필요 PII 필터링
처리 제한	수집 목적 외 처리 금지	AgentPermissions + 목적 기반 접근 제어
안전한 파기	보존 기간 경과 시 파기	TTL 기반 자동 삭제, 파기 증적
동의 기반 처리	정보 주체 동의 확인	동의 상태 확인 미들웨어
제3자 제공 제한	외부 LLM에 개인정보 전송 시 보호	PII 마스킹 후 전송, 마스킹 해제 불가 설계
열람/삭제 요청	정보 주체의 열람/삭제 요청 대응	audit 로그 기반 처리 이력 추출

규제별 우선순위

실제 적용 시에는 해당 도메인의 법률 전문가와 협의가 필수입니다. 위 체크리스트는 기술 구현의 방향성을 제시하는 것이며, 법적 구속력이 있는 가이드가 아닙니다.

스케일링 전략

병목	대응
라우팅 step 과부하	사전 규칙 필터 추가
외부 API rate limit	큐잉, backpressure, batching
evaluator 비용 증가	샘플링 평가와 offline eval 분리
사람 승인 적체	risk 기준으로 approval tier 분리

롤아웃 전략

shadow run으로 기존 수동 프로세스와 나란히 비교한다.
low-risk use case만 canary로 연다.
human override와 incident rate를 기준으로 범위를 넓힌다.
고위험 write는 마지막까지 approval을 유지한다.

최소 운영 스켈레톤

type RunBudget = {
  maxSteps: number
  maxCostUsd: number
}

async function runProductionFlow(input: string, budget: RunBudget) {
  const trace = startTrace()
  const route = await routeModel(input)

  const result = await executeWithBudget(route, budget, trace.id)

  if (result.riskLevel === 'high') {
    return requestHumanApproval(result, trace.id)
  }

  return finalizeRun(result, trace.id)
}

이 정도 제어면만 갖춰도 "모델 호출 코드"가 아니라 "운영 가능한 시스템"으로 다루기가 한결 쉬워집니다.

SLO 예시

지표	목표 예시
P95 latency	15초 이하
task success rate	90% 이상
human override rate	use case별 추세 관리
write action incident	0에 가깝게 유지
cost per task	예산 범위 내 유지

SLO 정의 템플릿

멀티에이전트 시스템은 단일 API와 달리 여러 에이전트가 함께 움직인 결과로 SLO가 결정됩니다. 아래 템플릿을 기준으로 시스템 특성에 맞게 조정하십시오.

interface AgentSystemSLO {
  // 응답 시간 SLO
  latency: {
    p50TargetMs: number  // 일반적 경험 품질
    p95TargetMs: number  // 최악 허용 범위
    p99TargetMs: number  // 극단치 감시 기준
    timeoutMs: number    // 이 시간 초과 시 강제 종료
  }

  // 성공률 SLO
  successRate: {
    targetPercent: number       // 월간 목표 (예: 95%)
    errorBudgetPercent: number  // 허용 실패 비율 (= 100 - target)
    countingWindow: 'rolling_7d' | 'calendar_month'
    excludeFromCount: string[]  // 사용자 입력 오류 등 제외 조건
  }

  // 비용 SLO
  cost: {
    maxPerRequestUsd: number    // 요청 1건 상한
    maxDailyUsd: number         // 일일 총액 상한
    p95PerRequestUsd: number    // P95 기준 비용 목표
  }

  // 품질 SLO (선택)
  quality?: {
    minAccuracyPercent: number  // 정답률 (평가 파이프라인 필요)
    maxHallucinationRate: number // 환각 비율 상한
    humanOverrideRateTarget: number // 사람 개입 비율 목표
  }
}

// 예시: 고객 지원 에이전트 시스템 SLO
const supportAgentSLO: AgentSystemSLO = {
  latency: {
    p50TargetMs: 5_000,
    p95TargetMs: 15_000,
    p99TargetMs: 30_000,
    timeoutMs: 60_000,
  },
  successRate: {
    targetPercent: 95,
    errorBudgetPercent: 5,
    countingWindow: 'rolling_7d',
    excludeFromCount: ['invalid_input', 'rate_limited'],
  },
  cost: {
    maxPerRequestUsd: 0.50,
    maxDailyUsd: 500,
    p95PerRequestUsd: 0.30,
  },
  quality: {
    minAccuracyPercent: 90,
    maxHallucinationRate: 0.02,
    humanOverrideRateTarget: 0.10,
  },
}

SLO 위반 시 자동 대응

SLO를 정의만 해 놓고 수동 대응에 기대면 야간이나 주말에 장애가 커집니다. 위반을 감지하면 자동으로 작동하는 대응 체계를 갖춰야 합니다.

// SLO 자동 대응 정책
interface SLOBreachPolicy {
  sloType: 'latency' | 'cost' | 'successRate'
  severity: 'warning' | 'critical'
  actions: SLOAction[]
}

type SLOAction =
  | { type: 'circuit_breaker'; agentId: string; cooldownSec: number }
  | { type: 'model_downgrade'; from: string; to: string }
  | { type: 'disable_optional_agents'; agentIds: string[] }
  | { type: 'rate_limit'; maxRequestsPerMin: number }
  | { type: 'escalate'; channel: 'slack' | 'pagerduty' }

const breachPolicies: SLOBreachPolicy[] = [
  // 지연 SLO warning: 경량 모델로 전환
  {
    sloType: 'latency',
    severity: 'warning',
    actions: [
      { type: 'model_downgrade', from: 'claude-sonnet', to: 'claude-haiku' },
    ],
  },
  // 지연 SLO critical: circuit breaker + 에스컬레이션
  {
    sloType: 'latency',
    severity: 'critical',
    actions: [
      {
        type: 'circuit_breaker',
        agentId: 'slow-agent',
        cooldownSec: 300,
      },
      { type: 'escalate', channel: 'pagerduty' },
    ],
  },
  // 비용 SLO warning: 선택적 agent 비활성화
  {
    sloType: 'cost',
    severity: 'warning',
    actions: [
      {
        type: 'disable_optional_agents',
        agentIds: ['quality-checker', 'summarizer'],
      },
    ],
  },
  // 비용 SLO critical: rate limit + 모델 다운그레이드
  {
    sloType: 'cost',
    severity: 'critical',
    actions: [
      { type: 'rate_limit', maxRequestsPerMin: 10 },
      { type: 'model_downgrade', from: 'claude-sonnet', to: 'claude-haiku' },
      { type: 'escalate', channel: 'slack' },
    ],
  },
  // 성공률 SLO critical: 사람 에스컬레이션
  {
    sloType: 'successRate',
    severity: 'critical',
    actions: [
      { type: 'escalate', channel: 'pagerduty' },
    ],
  },
]

Error Budget 소진 시 동결

Error budget이 소진되면 새 기능 배포를 동결하고 안정성 개선에 집중해야 합니다. Google SRE 원칙과 동일하게, 에이전트 시스템에서도 error budget 정책을 운영하면 "기능 추가 vs 안정성"의 균형을 객관적으로 관리할 수 있습니다.

안티패턴

안티패턴	문제	개선
trace 없이 로그만 남김	run 전체를 재구성하기 어려움	request/run span 연결
고위험 action도 완전 자동화	사고 시 책임과 복구가 어려움	approval gate 유지
비용을 월말에만 확인	runaway loop를 늦게 발견	request 단위 예산 적용
rollout 없이 전면 전환	regression 위험	shadow/canary 운영

ADR 스타일 결론

Decision

프로덕션 운영의 기본값은 보수적으로 둡니다. step별 모델 라우팅, end-to-end trace, 행동 가드레일, request 단위 비용 한도, shadow/canary 기반 롤아웃을 운영 표준으로 삼고, 고위험 write는 사람 승인 뒤에 둡니다.

실무 체크리스트

request부터 tool 호출까지 correlation id가 이어지는가
parent → child agent 간 trace context가 전파되는가
OpenTelemetry GenAI Semantic Conventions를 따르고 있는가
trace sampling 전략이 정의되어 있는가 (에러/고비용은 100% 수집)
step별 모델 선택 기준이 있는가
에이전트별 token budget과 비용 상한이 분배되어 있는가
prompt caching이 적용되어 비용이 최적화되었는가
비용 알림 임계값이 다단계로 설정되어 있는가
비용 한도와 max step이 설정되어 있는가
SLO가 응답 시간, 성공률, 비용 세 축으로 정의되어 있는가
SLO 위반 시 자동 대응(circuit breaker, model downgrade)이 설정되어 있는가
에이전트별 최소 권한 원칙이 적용되어 있는가
감사 로그가 who/what/when/context 구조로 불변 저장되는가
도메인별 compliance 요건이 확인되었는가
고위험 write에 approval gate가 있는가
canary와 rollback 경로가 정의되어 있는가

다음에 읽을 장