Testing & Observability Guardrails

The final quality layer ensures continuous validation through comprehensive testing and real-time observability.

Purpose

Testing & observability guardrails ensure:

Regression prevention through automated tests
Performance tracking with detailed instrumentation
Quick triage with diagnostic events
Continuous quality through CI/CD gates

Performance & Diagnostics

Frontend Timing

Collected on client side:

{
  // User-facing timing
  requestStartTime: number,
  firstEventTime: number,
  ttfcMs: number,              // Time to First Chunk
  totalDurationMs: number,
  
  // Detailed breakdown
  contextExtractionMs: number,  // From perf event
  searchDurationMs: number,     // From perf event
  rerankTimeMs: number,         // From perf event
  diversityTimeMs: number,      // From perf event
  
  // Network timing
  networkLatency: number        // Derived
}

Backend Perf Events

Emitted via SSE:

yield {
  event: 'perf',
  data: JSON.stringify({
    contextExtractionMs: 900,
    multiSearchMs: 200,
    funnelMs: 50,
    rerankMs: 150,
    diversityMs: 10,
    ttfcHintMs: 800
  })
};

Correlation:

Frontend TTFC: 750ms
Backend hint: 800ms
→ Network latency: ~50ms (good)

Frontend TTFC: 1500ms
Backend hint: 800ms
→ Network latency: ~700ms (investigate!)

Log Tagging by Stage

Structured logging for quick triage:

// Context Orchestration
console.log('[CONTEXT]', {
  stage: 'extraction',
  duration: 900,
  intent: 'product_search',
  confidence: 0.95
});

// Search Pipeline
console.log('[SEARCH]', {
  stage: 'multi-search',
  queries: 3,
  candidates: 1200,
  duration: 200
});

// Response Generation
console.log('[GENERATION]', {
  stage: 'streaming',
  tokens: 150,
  duration: 2000,
  productsMentioned: 3
});

Benefit: Filter logs by [CONTEXT], [SEARCH], etc. to isolate issues

Runbook-Grade Tests

Seven Phase Test Scripts

# Run all phase tests
./tests/run-all-phase-tests.sh

# Individual phases
./tests/phase-0-context.test.sh
./tests/phase-1-funnel.test.sh
./tests/phase-2-rerank.test.sh
./tests/phase-3-diversity.test.sh
./tests/phase-4-persistence.test.sh
./tests/phase-5-cultural-rules.test.sh
./tests/phase-6-embeddings.test.sh

Each script proves:

Phase input/output contract
Performance within bounds
Error handling works
Integration with next phase

Test Coverage by Phase

CI/QA Gates

Pre-commit

# Automatic checks before commit
- Lint (ESLint)
- Type check (TypeScript)
- Unit tests
- File size check (&lt;800 lines)

Pre-merge

# CI pipeline checks
- All unit tests pass
- Integration tests pass
- Coverage threshold met (>80%)
- Performance benchmarks pass
- Build succeeds

Pre-deploy

# Deployment gates
- E2E tests pass
- Load tests pass
- No critical errors in logs
- Smoke tests on staging
- Performance regression check

Quality Dashboards

Metrics Tracked

Alert Thresholds

const DASHBOARD_ALERTS = {
  // Performance
  TTFC_P95: {
    warning: 1200,    // >1.2s
    critical: 2000    // >2s
  },
  
  // Quality
  HALLUCINATION_RATE: {
    warning: 0.02,    // >2%
    critical: 0.05    // >5%
  },
  
  // Reliability
  ERROR_RATE: {
    warning: 0.02,    // >2%
    critical: 0.05    // >5%
  },
  
  TIMEOUT_RATE: {
    warning: 0.02,
    critical: 0.05
  }
};

Diagnostic Events

Event Structure

interface DiagnosticEvent {
  timestamp: number,
  stage: string,
  duration: number,
  success: boolean,
  metadata: Record<string, any>
}

Collection Points

// Emit at every major step
diagnostics.emit({
  stage: 'context-extraction',
  duration: 900,
  success: true,
  metadata: {
    intent: 'product_search',
    confidence: 0.95,
    productType: 'Raamat'
  }
});

diagnostics.emit({
  stage: 'search-execution',
  duration: 1200,
  success: true,
  metadata: {
    queries: 3,
    candidates: 1200,
    finalists: 20
  }
});

Triage Speed

With proper tagging:

# Find slow context extractions
grep '\[CONTEXT\]' logs | grep 'duration.*[2-9][0-9]{3}'

# Find search failures
grep '\[SEARCH\]' logs | grep 'success: false'

# Find hallucinations
grep '\[VALIDATION\]' logs | grep 'severity: high'

Result: Pin failures in seconds, not hours

Testing Best Practices

Do's

Test one thing per test
Use descriptive test names
Mock external dependencies
Test edge cases
Measure performance
Clean up test data

Don'ts

Don't test implementation details
Don't skip error cases
Don't ignore flaky tests
Don't hardcode test data
Don't skip cleanup

Known Gaps & Watch-outs

Frontend Exclude Staleness

Issue: Exclude IDs from current streaming turn not appended before next request.

Impact: Backend dedup relies on stored context; duplicates can slip inside live turn chain.

Mitigation: Backend merge ensures no duplicates across requests.

Validator Language Bias

Issue: Hallucination checks are Estonian-heavy; English patterns less comprehensive.

Impact: English-only flows may be over-flagged.

Mitigation: Expand English pattern library progressively.

Fallback Transparency

Issue: Broadened-category fallbacks rely on LLM to surface meta flags.

Impact: Users may not know pool is depleted.

Mitigation: Ensure UI copy keeps users aware when fallbacks trigger.

How to Extend Safely

Adding New Phases

// 1. Add perf event
yield {
  event: 'perf',
  data: JSON.stringify({
    yourNewPhaseMs: duration
  })
};

// 2. Add diagnostic logging
console.log('[YOUR_PHASE]', {
  stage: 'execution',
  duration,
  result
});

// 3. Add tests
describe('Your New Phase', () => {
  // Comprehensive test suite
});

// 4. Add monitoring
trackMetric('yourPhase.duration', duration);

Result: Regressions are traceable

Expanding Hallucination Allow-list

//  GOOD: Add domain-specific nouns
const ALLOWED_NOUNS = new Set([
  'hobbit',      // Character name
  'gondor',      // Place name
  'minas tirith' // Location
]);

//  BAD: Loosen thresholds
const THRESHOLD = 0.2;  // Too permissive!

Tweaking Funnel Rules

# 1. Modify funnel logic
# 2. Re-run full phase suite
./tests/run-all-phase-tests.sh

# 3. Watch perf pills in UI
# 4. Verify TTFC and pool sizes

Monitoring Integration

Metrics Export

// Export to monitoring service
function exportMetrics() {
  return {
    timestamp: Date.now(),
    
    // Performance
    ttfc_p50: percentile(ttfcSamples, 50),
    ttfc_p95: percentile(ttfcSamples, 95),
    ttfc_p99: percentile(ttfcSamples, 99),
    
    // Quality
    hallucination_rate: hallucinationCount / totalRequests,
    intent_accuracy: correctIntents / totalIntents,
    
    // Reliability
    error_rate: errorCount / totalRequests,
    timeout_rate: timeoutCount / totalRequests,
    success_rate: successCount / totalRequests
  };
}

Alert Integration

// Trigger alerts on threshold breach
if (metrics.ttfc_p95 > THRESHOLDS.TTFC_P95) {
  alert({
    type: 'performance',
    severity: 'warning',
    metric: 'ttfc_p95',
    value: metrics.ttfc_p95,
    threshold: THRESHOLDS.TTFC_P95,
    action: 'Investigate backend latency'
  });
}

Overview - Quality philosophy
Frontend Guardrails - Phase 1
Context Guardrails - Phase 2
Search Guardrails - Phase 3
Response Guardrails - Phase 4
Validation Guardrails - Phase 5

Purpose​

Performance & Diagnostics​

Frontend Timing​

Backend Perf Events​

Log Tagging by Stage​

Runbook-Grade Tests​

Seven Phase Test Scripts​

Test Coverage by Phase​

CI/QA Gates​

Pre-commit​

Pre-merge​

Pre-deploy​

Quality Dashboards​

Metrics Tracked​

Alert Thresholds​

Diagnostic Events​

Event Structure​

Collection Points​

Triage Speed​

Testing Best Practices​

Do's​

Don'ts​

Known Gaps & Watch-outs​

Frontend Exclude Staleness​

Validator Language Bias​

Fallback Transparency​

How to Extend Safely​

Adding New Phases​

Expanding Hallucination Allow-list​

Tweaking Funnel Rules​

Monitoring Integration​

Metrics Export​

Alert Integration​

Related Documentation​