라우팅과 디스패치

Classifier 기반 라우팅, semantic routing, 폴백 체인, 동적 에이전트 선택

오케스트레이션의 첫 번째 실패 지점은 대부분 잘못된 라우팅입니다. 무엇을 누구에게 보낼지 잘못 결정하면 이후의 정교한 agent 설계는 거의 의미가 없어집니다.

라우팅 방식 비교

방식	적합한 상황	장점	주의점
Deterministic rules	명시적 조건이 있는 경우	설명 가능성 높음	규칙 폭증 가능
Classifier routing	입력 유형이 다양할 때	유연하고 빠름	분류 drift 가능
Semantic routing	의미 유사도로 적합한 작업을 찾을 때	새로운 표현에 강함	근거 설명이 약할 수 있음
Policy engine	권한, 비용, SLA 조건이 함께 중요할 때	운영 통제에 강함	설계와 유지 비용 증가

실무에서는 보통 규칙 -> 분류기 -> semantic fallback 순서를 추천합니다. 처음부터 전부 LLM에 맡기면 비용과 설명 가능성이 모두 나빠집니다.

권장 디스패치 파이프라인

이 구조의 핵심은 "한 번의 마법 같은 분류"가 아니라, 낮은 비용의 필터를 앞단에 배치하는 것입니다.

동적 에이전트 선택

동적으로 agent를 고를 때는 자유 텍스트보다 capability registry가 낫습니다.

필드	설명
agent_id	고유 식별자
capabilities	수행 가능한 작업
allowed_tools	접근 가능한 tool 묶음
cost_profile	대략적인 비용/지연 특성
locale / domain	지원 언어, 도메인
status	사용 가능, 점검 중, deprecated

라우터는 이 registry를 보고 후보를 줄인 뒤 최종 agent를 선택합니다.

confidence와 fallback

confidence를 모델이 임의 숫자로 말하게 두는 것만으로는 충분하지 않습니다. 다음 기준과 함께 써야 합니다.

최소 confidence threshold
low confidence 시 fallback 경로
ambiguity category 기록
human triage 조건

예를 들어 billing과 refund를 자주 혼동한다면, 이 두 클래스는 같은 worker로 보내고 내부에서 다시 세분화하는 편이 더 낫습니다.

비용과 SLA를 함께 고려한 디스패치

조건	디스패치 전략
대량, 저위험 요청	저비용 라우터 + 단순 worker
고위험 write 요청	deterministic policy + human approval
긴 컨텍스트가 필요한 요청	retrieval 후 전문 worker agent
지연에 민감한 요청	semantic search보다 rules 우선

최소 구현 스켈레톤

type RouteResult = {
  target: string
  confidence: number
  reason: string
}

async function dispatchRequest(input: string) {
  if (matchesPolicyRule(input)) return sendTo('policy-gate')

  const route: RouteResult = await classify(input)

  if (route.confidence >= 0.85) return sendTo(route.target)
  if (route.confidence >= 0.6) return sendTo('generalist-review')

  return sendTo('human-triage')
}

핵심은 classifier를 넣는 것 자체가 아니라, low confidence -> safer path를 코드 경로로 강제하는 데 있습니다.

잘못된 라우팅을 줄이는 방법

클래스 정의를 업무 언어로 다시 쓴다.
negative example을 eval 세트에 포함한다.
애매한 입력을 억지로 분류하지 말고 needs_triage 클래스를 둔다.
최종 사용자 intent보다 시스템이 취해야 할 action 기준으로 분류한다.

안티패턴

안티패턴	문제	개선
agent 이름만 보고 자유롭게 선택	capability 충돌이 생김	registry 기반 선택
항상 가장 강한 모델로 라우팅	비용 급증	2단 라우팅 도입
confidence가 낮아도 강제 분류	오분류 누적	`needs_triage`와 human fallback
분류 기준이 출력 주제 중심	실제 action과 맞지 않음	action-oriented taxonomy

Confidence Calibration

분류기가 반환하는 confidence 수치는 그 자체로 신뢰할 수 없습니다. 모델이 "0.92"라고 말해도 실제 정확도가 92%라는 뜻이 아닙니다. 올바른 threshold를 설정하려면 calibration 과정이 필요합니다.

임계값 설정의 딜레마

threshold 수준	현상	결과
너무 높음 (0.95+)	대부분의 입력이 fallback으로 빠짐	human triage 과부하, 응답 지연
너무 낮음 (0.5 이하)	확신 없는 분류도 통과	오분류 누적, 사용자 불만
적정 (0.7~0.85)	precision과 recall의 균형	도메인별 튜닝 필요

핵심은 단일 고정값이 아니라 도메인과 위험도에 따라 다른 threshold를 쓰는 것입니다. 예를 들어 결제 취소 같은 고위험 작업은 threshold를 높이고, 일반 FAQ는 낮춰도 됩니다.

Precision-Recall 기반 Threshold 선택

from sklearn.metrics import precision_recall_curve
import numpy as np

def find_optimal_threshold(
    y_true: list[int],
    y_scores: list[float],
    min_precision: float = 0.90,
) -> float:
    """
    최소 precision을 보장하면서 가장 높은 recall을 달성하는
    threshold를 찾는다.
    """
    precisions, recalls, thresholds = precision_recall_curve(
        y_true, y_scores
    )

    # min_precision 이상인 구간에서 recall이 최대인 threshold
    valid = precisions[:-1] >= min_precision
    if not valid.any():
        return float(thresholds[-1])  # 가장 보수적인 값

    best_idx = np.where(valid)[0][np.argmax(recalls[:-1][valid])]
    return float(thresholds[best_idx])


# 사용 예시
# y_true: 정답 라벨 (1=해당 클래스, 0=아님)
# y_scores: 분류기가 반환한 confidence 값
threshold = find_optimal_threshold(y_true, y_scores, min_precision=0.90)
print(f"최적 threshold: {threshold:.3f}")

클래스별 차등 Threshold

const THRESHOLDS: Record<string, { high: number; low: number }> = {
  'payment-cancel':  { high: 0.90, low: 0.75 },  // 고위험: 높은 기준
  'general-inquiry': { high: 0.70, low: 0.50 },  // 저위험: 낮은 기준
  'account-delete':  { high: 0.92, low: 0.80 },  // 고위험
  'faq':             { high: 0.65, low: 0.40 },  // 저위험
}

function routeWithCalibration(category: string, confidence: number) {
  const t = THRESHOLDS[category] ?? { high: 0.85, low: 0.60 }

  if (confidence >= t.high) return 'direct-dispatch'
  if (confidence >= t.low)  return 'generalist-review'
  return 'human-triage'
}

주의

Calibration은 일회성이 아닙니다. 입력 분포가 바뀌면 threshold도 재조정해야 합니다. 최소 월 1회 eval 세트로 precision-recall 곡선을 다시 확인하세요.

Drift Detection

배포 직후에는 잘 동작하던 라우팅이 시간이 지나면서 성능이 떨어지는 현상을 drift라고 합니다. 새로운 유형의 요청이 유입되거나, 사용자 표현 패턴이 바뀌거나, 제품 기능이 추가되면서 기존 분류 체계가 현실과 맞지 않게 됩니다.

모니터링 지표

지표	측정 방법	경고 기준 (예시)
라우팅 정확도	샘플링 기반 사후 평가	주간 accuracy가 5%p 이상 하락
Confidence 분포	평균·중앙값·p10 추적	평균 confidence가 0.1 이상 하락
Fallback 비율	`human-triage` 또는 `needs_triage` 비율	전체 요청의 20% 초과
신규 클러스터	기존 카테고리에 속하지 않는 입력 군집	미분류 클러스터 크기가 일일 요청의 5% 초과
오분류 피드백	사용자 또는 운영자 보고	주간 오분류 보고 건수 증가 추세

Drift 탐지 구현

interface DriftMetrics {
  windowStart: Date
  windowEnd: Date
  accuracy: number
  avgConfidence: number
  fallbackRate: number
  totalRequests: number
}

function detectDrift(
  baseline: DriftMetrics,
  current: DriftMetrics,
): { drifted: boolean; reasons: string[] } {
  const reasons: string[] = []

  const accuracyDrop = baseline.accuracy - current.accuracy
  if (accuracyDrop > 0.05) {
    reasons.push(
      `accuracy ${accuracyDrop.toFixed(2)} 하락 ` +
      `(${baseline.accuracy.toFixed(2)} → ${current.accuracy.toFixed(2)})`
    )
  }

  const confidenceDrop = baseline.avgConfidence - current.avgConfidence
  if (confidenceDrop > 0.1) {
    reasons.push(
      `평균 confidence ${confidenceDrop.toFixed(2)} 하락`
    )
  }

  const fallbackIncrease = current.fallbackRate - baseline.fallbackRate
  if (fallbackIncrease > 0.1) {
    reasons.push(
      `fallback 비율 ${(fallbackIncrease * 100).toFixed(1)}%p 증가`
    )
  }

  return { drifted: reasons.length > 0, reasons }
}

자동 재학습 트리거 조건

단순히 drift를 감지하는 것만으로는 부족합니다. 재학습을 언제 실행할지 명확한 기준이 필요합니다.

트리거 조건	자동화 수준	비고
accuracy 5%p+ 하락	자동 재학습 파이프라인 실행	eval 세트 기반 검증 후 배포
fallback 비율 20%+	자동 알림 + 수동 검토	분류 체계 자체를 재설계해야 할 수도 있음
신규 클러스터 탐지	라벨링 요청 후 재학습	새 카테고리 추가 여부 판단 필요
오분류 피드백 누적	주간 리뷰에서 판단	특정 클래스 쌍 혼동 패턴 분석

라우팅 테스트 세트 작성법

라우팅 품질을 유지하려면 golden test set이 반드시 필요합니다. 이 세트는 라우팅 변경 시 regression을 잡아내는 안전망 역할을 합니다.

Golden Test Set 구성 원칙

카테고리별 10~20개: 각 라우팅 대상(agent/worker)마다 최소 10개, 이상적으로 20개의 테스트 케이스를 확보합니다.
실제 사용자 입력 기반: 합성 데이터보다 실제 로그에서 추출한 입력이 훨씬 효과적입니다.
난이도 분포: 명확한 입력 50%, 경계 사례 30%, 의도적 모호 입력 20%로 구성합니다.
정기 갱신: 월 1회 새로운 실제 입력을 추가하고 오래된 케이스를 교체합니다.

Edge Case 포함 전략

유형	예시	왜 필요한가
경계 사례	"환불인데 부분 환불이요" (환불 vs 부분환불)	인접 카테고리 간 혼동 탐지
다중 의도	"배송 조회하고 환불도 해주세요"	단일 분류로 처리 불가한 입력
모호한 입력	"이거 어떻게 해요"	정보 부족 시 fallback 동작 확인
도메인 외 입력	"오늘 날씨 어때?"	`needs_triage` 분류 확인
적대적 입력	"환불해줘 아니 환불 말고 배송 조회"	모순된 의도 처리 확인
긴 입력	500자 이상의 복합 요청	토큰 제한·요약 오류 탐지

테스트 세트 관리 코드

interface RoutingTestCase {
  id: string
  input: string
  expectedTarget: string
  category: 'clear' | 'boundary' | 'ambiguous' | 'adversarial'
  addedAt: string        // ISO date
  source: 'production-log' | 'synthetic' | 'feedback'
  notes?: string
}

const goldenTestSet: RoutingTestCase[] = [
  {
    id: 'refund-001',
    input: '지난주 결제한 거 환불 가능한가요?',
    expectedTarget: 'refund-agent',
    category: 'clear',
    addedAt: '2026-03-01',
    source: 'production-log',
  },
  {
    id: 'boundary-001',
    input: '부분 환불 되나요? 배송비만 빼고요',
    expectedTarget: 'refund-agent',
    category: 'boundary',
    addedAt: '2026-03-01',
    source: 'production-log',
    notes: 'partial-refund와 refund 경계',
  },
  {
    id: 'multi-001',
    input: '배송 어디쯤 왔는지 확인하고, 안 오면 환불할게요',
    expectedTarget: 'needs_triage',
    category: 'ambiguous',
    addedAt: '2026-03-05',
    source: 'feedback',
    notes: '다중 의도: 배송조회 + 조건부 환불',
  },
]

async function runRoutingEval(
  router: (input: string) => Promise<{ target: string; confidence: number }>,
  testSet: RoutingTestCase[],
) {
  const results = await Promise.all(
    testSet.map(async (tc) => {
      const result = await router(tc.input)
      return {
        id: tc.id,
        category: tc.category,
        expected: tc.expectedTarget,
        actual: result.target,
        confidence: result.confidence,
        pass: result.target === tc.expectedTarget,
      }
    })
  )

  const total = results.length
  const passed = results.filter((r) => r.pass).length
  const byCategory = Object.groupBy(results, (r) => r.category)

  console.log(`전체: ${passed}/${total} (${((passed / total) * 100).toFixed(1)}%)`)
  for (const [cat, items] of Object.entries(byCategory)) {
    const catPassed = items!.filter((r) => r.pass).length
    console.log(`  ${cat}: ${catPassed}/${items!.length}`)
  }

  // 실패 케이스 상세
  results
    .filter((r) => !r.pass)
    .forEach((r) => {
      console.log(
        `  FAIL [${r.id}] expected=${r.expected} actual=${r.actual} ` +
        `confidence=${r.confidence.toFixed(3)}`
      )
    })

  return results
}

팁

테스트 세트를 JSON 파일로 분리해서 버전 관리하세요. 라우팅 로직 변경 PR마다 이 eval을 CI에서 자동 실행하면 regression을 조기에 발견할 수 있습니다.

Semantic Routing 임베딩 선택 가이드

Semantic routing은 입력 텍스트를 임베딩 벡터로 변환한 뒤, 미리 등록된 route 벡터와 cosine similarity를 비교하여 가장 적합한 경로를 선택합니다. 이때 어떤 임베딩 모델을 쓰느냐가 라우팅 정확도와 비용을 크게 좌우합니다.

임베딩 모델 비교

모델	차원	한국어 지원	비용 (1M 토큰)	속도	비고
OpenAI `text-embedding-3-large`	3072	양호	~$0.13	빠름	차원 축소 옵션 지원 (256~3072)
OpenAI `text-embedding-3-small`	1536	양호	~$0.02	매우 빠름	비용 대비 성능 우수
Cohere `embed-v4.0`	1024	양호	~$0.10	빠름	검색 특화, matryoshka 지원
Voyage `voyage-3-large`	1024	보통	~$0.18	보통	코드 임베딩에 강점
`multilingual-e5-large`	1024	우수	무료 (셀프호스팅)	GPU 필요	다국어 성능 상위권
`BGE-m3`	1024	우수	무료 (셀프호스팅)	GPU 필요	dense + sparse + colbert 지원
`KR-SBERT-V40K-klueNLI-augSTS`	768	최우수	무료 (셀프호스팅)	GPU 필요	한국어 전용, STS 벤치마크 상위

선택 기준

한국어 비중이 높은 서비스: multilingual-e5-large, BGE-m3, 또는 한국어 전용 모델 우선 검토
빠른 프로토타이핑: text-embedding-3-small (저비용, API 즉시 사용 가능)
정확도 최우선: text-embedding-3-large + 도메인별 fine-tuning 또는 BGE-m3
셀프호스팅 가능: 오픈소스 모델 + GPU 서버 (장기적으로 비용 절감)

Cosine Similarity Threshold 결정 방법

임베딩 모델을 선택한 뒤에는 어느 수준의 유사도를 "매칭"으로 판정할지 결정해야 합니다. 이 threshold는 모델마다, 도메인마다 다르기 때문에 반드시 실험으로 결정합니다.

import numpy as np
from sklearn.metrics import f1_score

def find_similarity_threshold(
    similarities: list[float],
    labels: list[int],  # 1=정답 매칭, 0=오매칭
    candidates: np.ndarray | None = None,
) -> dict:
    """
    다양한 threshold에서 F1 score를 계산하고
    최적 threshold를 반환한다.
    """
    if candidates is None:
        candidates = np.arange(0.50, 0.96, 0.01)

    sims = np.array(similarities)
    labs = np.array(labels)
    best = {'threshold': 0.0, 'f1': 0.0}

    for t in candidates:
        preds = (sims >= t).astype(int)
        f1 = f1_score(labs, preds)
        if f1 > best['f1']:
            best = {'threshold': float(t), 'f1': float(f1)}

    return best


# 사용 예시
# 검증 세트에서 각 (입력, route) 쌍의 cosine similarity와 정답 여부를 수집
result = find_similarity_threshold(
    similarities=[0.89, 0.72, 0.91, 0.55, 0.83, 0.61],
    labels=      [1,    0,    1,    0,    1,    0],
)
print(f"최적 threshold: {result['threshold']:.2f}, F1: {result['f1']:.3f}")

실전 Semantic Router 구현

interface SemanticRoute {
  name: string
  description: string          // route를 설명하는 텍스트
  examples: string[]           // 해당 route에 해당하는 예시 입력
  embedding?: number[]         // 사전 계산된 평균 임베딩
  threshold?: number           // route별 개별 threshold
}

async function semanticRoute(
  input: string,
  routes: SemanticRoute[],
  embed: (text: string) => Promise<number[]>,
  defaultThreshold = 0.78,
): Promise<{ route: string; similarity: number } | null> {
  const inputVec = await embed(input)

  let best: { route: string; similarity: number } | null = null

  for (const r of routes) {
    if (!r.embedding) continue
    const sim = cosineSimilarity(inputVec, r.embedding)
    const threshold = r.threshold ?? defaultThreshold

    if (sim >= threshold && (!best || sim > best.similarity)) {
      best = { route: r.name, similarity: sim }
    }
  }

  return best
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}

주의

Cosine similarity threshold를 한 번 정하고 고정하지 마세요. 임베딩 모델을 교체하면 유사도 분포가 완전히 달라집니다. 모델 변경 시 반드시 threshold를 재측정하세요.

ADR 스타일 결론

Decision

라우팅은 LLM 한 번 호출로 끝내지 않고, 규칙 기반 필터와 capability registry를 앞단에 둔 다단계 디스패치로 설계합니다. 애매한 입력은 억지로 분류하지 않고, fallback 또는 human triage 경로를 명시적으로 남깁니다.

실무 체크리스트

라우팅 taxonomy가 시스템 action 기준으로 정의되어 있는가
deterministic rule과 classifier의 책임이 분리되어 있는가
low confidence fallback이 있는가
agent registry에 권한과 비용 정보가 포함되는가
라우팅 오분류를 추적하는 eval 세트가 있는가

다음에 읽을 장

라우팅 방식 비교

방식	적합한 상황	장점	주의점
Deterministic rules	명시적 조건이 있는 경우	설명 가능성 높음	규칙 폭증 가능
Classifier routing	입력 유형이 다양할 때	유연하고 빠름	분류 drift 가능
Semantic routing	의미 유사도로 적합한 작업을 찾을 때	새로운 표현에 강함	근거 설명이 약할 수 있음
Policy engine	권한, 비용, SLA 조건이 함께 중요할 때	운영 통제에 강함	설계와 유지 비용 증가

실무에서는 보통 규칙 -> 분류기 -> semantic fallback 순서를 추천합니다. 처음부터 전부 LLM에 맡기면 비용과 설명 가능성이 모두 나빠집니다.

권장 디스패치 파이프라인

이 구조의 핵심은 "한 번의 마법 같은 분류"가 아니라, 낮은 비용의 필터를 앞단에 배치하는 것입니다.

동적 에이전트 선택

동적으로 agent를 고를 때는 자유 텍스트보다 capability registry가 낫습니다.

필드	설명
agent_id	고유 식별자
capabilities	수행 가능한 작업
allowed_tools	접근 가능한 tool 묶음
cost_profile	대략적인 비용/지연 특성
locale / domain	지원 언어, 도메인
status	사용 가능, 점검 중, deprecated

라우터는 이 registry를 보고 후보를 줄인 뒤 최종 agent를 선택합니다.

confidence와 fallback

confidence를 모델이 임의 숫자로 말하게 두는 것만으로는 충분하지 않습니다. 다음 기준과 함께 써야 합니다.

최소 confidence threshold
low confidence 시 fallback 경로
ambiguity category 기록
human triage 조건

예를 들어 billing과 refund를 자주 혼동한다면, 이 두 클래스는 같은 worker로 보내고 내부에서 다시 세분화하는 편이 더 낫습니다.

비용과 SLA를 함께 고려한 디스패치

조건	디스패치 전략
대량, 저위험 요청	저비용 라우터 + 단순 worker
고위험 write 요청	deterministic policy + human approval
긴 컨텍스트가 필요한 요청	retrieval 후 전문 worker agent
지연에 민감한 요청	semantic search보다 rules 우선

최소 구현 스켈레톤

type RouteResult = {
  target: string
  confidence: number
  reason: string
}

async function dispatchRequest(input: string) {
  if (matchesPolicyRule(input)) return sendTo('policy-gate')

  const route: RouteResult = await classify(input)

  if (route.confidence >= 0.85) return sendTo(route.target)
  if (route.confidence >= 0.6) return sendTo('generalist-review')

  return sendTo('human-triage')
}

핵심은 classifier를 넣는 것 자체가 아니라, low confidence -> safer path를 코드 경로로 강제하는 데 있습니다.

잘못된 라우팅을 줄이는 방법

클래스 정의를 업무 언어로 다시 쓴다.
negative example을 eval 세트에 포함한다.
애매한 입력을 억지로 분류하지 말고 needs_triage 클래스를 둔다.
최종 사용자 intent보다 시스템이 취해야 할 action 기준으로 분류한다.

안티패턴

안티패턴	문제	개선
agent 이름만 보고 자유롭게 선택	capability 충돌이 생김	registry 기반 선택
항상 가장 강한 모델로 라우팅	비용 급증	2단 라우팅 도입
confidence가 낮아도 강제 분류	오분류 누적	`needs_triage`와 human fallback
분류 기준이 출력 주제 중심	실제 action과 맞지 않음	action-oriented taxonomy

Confidence Calibration

임계값 설정의 딜레마

threshold 수준	현상	결과
너무 높음 (0.95+)	대부분의 입력이 fallback으로 빠짐	human triage 과부하, 응답 지연
너무 낮음 (0.5 이하)	확신 없는 분류도 통과	오분류 누적, 사용자 불만
적정 (0.7~0.85)	precision과 recall의 균형	도메인별 튜닝 필요

Precision-Recall 기반 Threshold 선택

from sklearn.metrics import precision_recall_curve
import numpy as np

def find_optimal_threshold(
    y_true: list[int],
    y_scores: list[float],
    min_precision: float = 0.90,
) -> float:
    """
    최소 precision을 보장하면서 가장 높은 recall을 달성하는
    threshold를 찾는다.
    """
    precisions, recalls, thresholds = precision_recall_curve(
        y_true, y_scores
    )

    # min_precision 이상인 구간에서 recall이 최대인 threshold
    valid = precisions[:-1] >= min_precision
    if not valid.any():
        return float(thresholds[-1])  # 가장 보수적인 값

    best_idx = np.where(valid)[0][np.argmax(recalls[:-1][valid])]
    return float(thresholds[best_idx])


# 사용 예시
# y_true: 정답 라벨 (1=해당 클래스, 0=아님)
# y_scores: 분류기가 반환한 confidence 값
threshold = find_optimal_threshold(y_true, y_scores, min_precision=0.90)
print(f"최적 threshold: {threshold:.3f}")

클래스별 차등 Threshold

const THRESHOLDS: Record<string, { high: number; low: number }> = {
  'payment-cancel':  { high: 0.90, low: 0.75 },  // 고위험: 높은 기준
  'general-inquiry': { high: 0.70, low: 0.50 },  // 저위험: 낮은 기준
  'account-delete':  { high: 0.92, low: 0.80 },  // 고위험
  'faq':             { high: 0.65, low: 0.40 },  // 저위험
}

function routeWithCalibration(category: string, confidence: number) {
  const t = THRESHOLDS[category] ?? { high: 0.85, low: 0.60 }

  if (confidence >= t.high) return 'direct-dispatch'
  if (confidence >= t.low)  return 'generalist-review'
  return 'human-triage'
}

주의

Calibration은 일회성이 아닙니다. 입력 분포가 바뀌면 threshold도 재조정해야 합니다. 최소 월 1회 eval 세트로 precision-recall 곡선을 다시 확인하세요.

Drift Detection

모니터링 지표

지표	측정 방법	경고 기준 (예시)
라우팅 정확도	샘플링 기반 사후 평가	주간 accuracy가 5%p 이상 하락
Confidence 분포	평균·중앙값·p10 추적	평균 confidence가 0.1 이상 하락
Fallback 비율	`human-triage` 또는 `needs_triage` 비율	전체 요청의 20% 초과
신규 클러스터	기존 카테고리에 속하지 않는 입력 군집	미분류 클러스터 크기가 일일 요청의 5% 초과
오분류 피드백	사용자 또는 운영자 보고	주간 오분류 보고 건수 증가 추세

Drift 탐지 구현

interface DriftMetrics {
  windowStart: Date
  windowEnd: Date
  accuracy: number
  avgConfidence: number
  fallbackRate: number
  totalRequests: number
}

function detectDrift(
  baseline: DriftMetrics,
  current: DriftMetrics,
): { drifted: boolean; reasons: string[] } {
  const reasons: string[] = []

  const accuracyDrop = baseline.accuracy - current.accuracy
  if (accuracyDrop > 0.05) {
    reasons.push(
      `accuracy ${accuracyDrop.toFixed(2)} 하락 ` +
      `(${baseline.accuracy.toFixed(2)} → ${current.accuracy.toFixed(2)})`
    )
  }

  const confidenceDrop = baseline.avgConfidence - current.avgConfidence
  if (confidenceDrop > 0.1) {
    reasons.push(
      `평균 confidence ${confidenceDrop.toFixed(2)} 하락`
    )
  }

  const fallbackIncrease = current.fallbackRate - baseline.fallbackRate
  if (fallbackIncrease > 0.1) {
    reasons.push(
      `fallback 비율 ${(fallbackIncrease * 100).toFixed(1)}%p 증가`
    )
  }

  return { drifted: reasons.length > 0, reasons }
}

자동 재학습 트리거 조건

단순히 drift를 감지하는 것만으로는 부족합니다. 재학습을 언제 실행할지 명확한 기준이 필요합니다.

트리거 조건	자동화 수준	비고
accuracy 5%p+ 하락	자동 재학습 파이프라인 실행	eval 세트 기반 검증 후 배포
fallback 비율 20%+	자동 알림 + 수동 검토	분류 체계 자체를 재설계해야 할 수도 있음
신규 클러스터 탐지	라벨링 요청 후 재학습	새 카테고리 추가 여부 판단 필요
오분류 피드백 누적	주간 리뷰에서 판단	특정 클래스 쌍 혼동 패턴 분석

라우팅 테스트 세트 작성법

라우팅 품질을 유지하려면 golden test set이 반드시 필요합니다. 이 세트는 라우팅 변경 시 regression을 잡아내는 안전망 역할을 합니다.

Golden Test Set 구성 원칙

카테고리별 10~20개: 각 라우팅 대상(agent/worker)마다 최소 10개, 이상적으로 20개의 테스트 케이스를 확보합니다.
실제 사용자 입력 기반: 합성 데이터보다 실제 로그에서 추출한 입력이 훨씬 효과적입니다.
난이도 분포: 명확한 입력 50%, 경계 사례 30%, 의도적 모호 입력 20%로 구성합니다.
정기 갱신: 월 1회 새로운 실제 입력을 추가하고 오래된 케이스를 교체합니다.

Edge Case 포함 전략

유형	예시	왜 필요한가
경계 사례	"환불인데 부분 환불이요" (환불 vs 부분환불)	인접 카테고리 간 혼동 탐지
다중 의도	"배송 조회하고 환불도 해주세요"	단일 분류로 처리 불가한 입력
모호한 입력	"이거 어떻게 해요"	정보 부족 시 fallback 동작 확인
도메인 외 입력	"오늘 날씨 어때?"	`needs_triage` 분류 확인
적대적 입력	"환불해줘 아니 환불 말고 배송 조회"	모순된 의도 처리 확인
긴 입력	500자 이상의 복합 요청	토큰 제한·요약 오류 탐지

테스트 세트 관리 코드

interface RoutingTestCase {
  id: string
  input: string
  expectedTarget: string
  category: 'clear' | 'boundary' | 'ambiguous' | 'adversarial'
  addedAt: string        // ISO date
  source: 'production-log' | 'synthetic' | 'feedback'
  notes?: string
}

const goldenTestSet: RoutingTestCase[] = [
  {
    id: 'refund-001',
    input: '지난주 결제한 거 환불 가능한가요?',
    expectedTarget: 'refund-agent',
    category: 'clear',
    addedAt: '2026-03-01',
    source: 'production-log',
  },
  {
    id: 'boundary-001',
    input: '부분 환불 되나요? 배송비만 빼고요',
    expectedTarget: 'refund-agent',
    category: 'boundary',
    addedAt: '2026-03-01',
    source: 'production-log',
    notes: 'partial-refund와 refund 경계',
  },
  {
    id: 'multi-001',
    input: '배송 어디쯤 왔는지 확인하고, 안 오면 환불할게요',
    expectedTarget: 'needs_triage',
    category: 'ambiguous',
    addedAt: '2026-03-05',
    source: 'feedback',
    notes: '다중 의도: 배송조회 + 조건부 환불',
  },
]

async function runRoutingEval(
  router: (input: string) => Promise<{ target: string; confidence: number }>,
  testSet: RoutingTestCase[],
) {
  const results = await Promise.all(
    testSet.map(async (tc) => {
      const result = await router(tc.input)
      return {
        id: tc.id,
        category: tc.category,
        expected: tc.expectedTarget,
        actual: result.target,
        confidence: result.confidence,
        pass: result.target === tc.expectedTarget,
      }
    })
  )

  const total = results.length
  const passed = results.filter((r) => r.pass).length
  const byCategory = Object.groupBy(results, (r) => r.category)

  console.log(`전체: ${passed}/${total} (${((passed / total) * 100).toFixed(1)}%)`)
  for (const [cat, items] of Object.entries(byCategory)) {
    const catPassed = items!.filter((r) => r.pass).length
    console.log(`  ${cat}: ${catPassed}/${items!.length}`)
  }

  // 실패 케이스 상세
  results
    .filter((r) => !r.pass)
    .forEach((r) => {
      console.log(
        `  FAIL [${r.id}] expected=${r.expected} actual=${r.actual} ` +
        `confidence=${r.confidence.toFixed(3)}`
      )
    })

  return results
}

팁

테스트 세트를 JSON 파일로 분리해서 버전 관리하세요. 라우팅 로직 변경 PR마다 이 eval을 CI에서 자동 실행하면 regression을 조기에 발견할 수 있습니다.

Semantic Routing 임베딩 선택 가이드

임베딩 모델 비교

모델	차원	한국어 지원	비용 (1M 토큰)	속도	비고
OpenAI `text-embedding-3-large`	3072	양호	~$0.13	빠름	차원 축소 옵션 지원 (256~3072)
OpenAI `text-embedding-3-small`	1536	양호	~$0.02	매우 빠름	비용 대비 성능 우수
Cohere `embed-v4.0`	1024	양호	~$0.10	빠름	검색 특화, matryoshka 지원
Voyage `voyage-3-large`	1024	보통	~$0.18	보통	코드 임베딩에 강점
`multilingual-e5-large`	1024	우수	무료 (셀프호스팅)	GPU 필요	다국어 성능 상위권
`BGE-m3`	1024	우수	무료 (셀프호스팅)	GPU 필요	dense + sparse + colbert 지원
`KR-SBERT-V40K-klueNLI-augSTS`	768	최우수	무료 (셀프호스팅)	GPU 필요	한국어 전용, STS 벤치마크 상위

선택 기준

한국어 비중이 높은 서비스: multilingual-e5-large, BGE-m3, 또는 한국어 전용 모델 우선 검토
빠른 프로토타이핑: text-embedding-3-small (저비용, API 즉시 사용 가능)
정확도 최우선: text-embedding-3-large + 도메인별 fine-tuning 또는 BGE-m3
셀프호스팅 가능: 오픈소스 모델 + GPU 서버 (장기적으로 비용 절감)

Cosine Similarity Threshold 결정 방법

import numpy as np
from sklearn.metrics import f1_score

def find_similarity_threshold(
    similarities: list[float],
    labels: list[int],  # 1=정답 매칭, 0=오매칭
    candidates: np.ndarray | None = None,
) -> dict:
    """
    다양한 threshold에서 F1 score를 계산하고
    최적 threshold를 반환한다.
    """
    if candidates is None:
        candidates = np.arange(0.50, 0.96, 0.01)

    sims = np.array(similarities)
    labs = np.array(labels)
    best = {'threshold': 0.0, 'f1': 0.0}

    for t in candidates:
        preds = (sims >= t).astype(int)
        f1 = f1_score(labs, preds)
        if f1 > best['f1']:
            best = {'threshold': float(t), 'f1': float(f1)}

    return best


# 사용 예시
# 검증 세트에서 각 (입력, route) 쌍의 cosine similarity와 정답 여부를 수집
result = find_similarity_threshold(
    similarities=[0.89, 0.72, 0.91, 0.55, 0.83, 0.61],
    labels=      [1,    0,    1,    0,    1,    0],
)
print(f"최적 threshold: {result['threshold']:.2f}, F1: {result['f1']:.3f}")

실전 Semantic Router 구현

interface SemanticRoute {
  name: string
  description: string          // route를 설명하는 텍스트
  examples: string[]           // 해당 route에 해당하는 예시 입력
  embedding?: number[]         // 사전 계산된 평균 임베딩
  threshold?: number           // route별 개별 threshold
}

async function semanticRoute(
  input: string,
  routes: SemanticRoute[],
  embed: (text: string) => Promise<number[]>,
  defaultThreshold = 0.78,
): Promise<{ route: string; similarity: number } | null> {
  const inputVec = await embed(input)

  let best: { route: string; similarity: number } | null = null

  for (const r of routes) {
    if (!r.embedding) continue
    const sim = cosineSimilarity(inputVec, r.embedding)
    const threshold = r.threshold ?? defaultThreshold

    if (sim >= threshold && (!best || sim > best.similarity)) {
      best = { route: r.name, similarity: sim }
    }
  }

  return best
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}

주의

ADR 스타일 결론

Decision

실무 체크리스트

라우팅 taxonomy가 시스템 action 기준으로 정의되어 있는가
deterministic rule과 classifier의 책임이 분리되어 있는가
low confidence fallback이 있는가
agent registry에 권한과 비용 정보가 포함되는가
라우팅 오분류를 추적하는 eval 세트가 있는가

라우팅과 디스패치

목차

라우팅과 디스패치

목차