Skip to main content

Testing & Observability Guardrails

The final quality layer ensures continuous validation through comprehensive testing and real-time observability.

Purpose

Testing & observability guardrails ensure:

  • Regression prevention through automated tests
  • Performance tracking with detailed instrumentation
  • Quick triage with diagnostic events
  • Continuous quality through CI/CD gates

Performance & Diagnostics

Frontend Timing

Collected on client side:

{
// User-facing timing
requestStartTime: number,
firstEventTime: number,
ttfcMs: number, // Time to First Chunk
totalDurationMs: number,

// Detailed breakdown
contextExtractionMs: number, // From perf event
searchDurationMs: number, // From perf event
rerankTimeMs: number, // From perf event
diversityTimeMs: number, // From perf event

// Network timing
networkLatency: number // Derived
}

Backend Perf Events

Emitted via SSE:

yield {
event: 'perf',
data: JSON.stringify({
contextExtractionMs: 900,
multiSearchMs: 200,
funnelMs: 50,
rerankMs: 150,
diversityMs: 10,
ttfcHintMs: 800
})
};

Correlation:

Frontend TTFC: 750ms
Backend hint: 800ms
→ Network latency: ~50ms (good)

Frontend TTFC: 1500ms
Backend hint: 800ms
→ Network latency: ~700ms (investigate!)

Log Tagging by Stage

Structured logging for quick triage:

// Context Orchestration
console.log('[CONTEXT]', {
stage: 'extraction',
duration: 900,
intent: 'product_search',
confidence: 0.95
});

// Search Pipeline
console.log('[SEARCH]', {
stage: 'multi-search',
queries: 3,
candidates: 1200,
duration: 200
});

// Response Generation
console.log('[GENERATION]', {
stage: 'streaming',
tokens: 150,
duration: 2000,
productsMentioned: 3
});

Benefit: Filter logs by [CONTEXT], [SEARCH], etc. to isolate issues

Runbook-Grade Tests

Seven Phase Test Scripts

# Run all phase tests
./tests/run-all-phase-tests.sh

# Individual phases
./tests/phase-0-context.test.sh
./tests/phase-1-funnel.test.sh
./tests/phase-2-rerank.test.sh
./tests/phase-3-diversity.test.sh
./tests/phase-4-persistence.test.sh
./tests/phase-5-cultural-rules.test.sh
./tests/phase-6-embeddings.test.sh

Each script proves:

  • Phase input/output contract
  • Performance within bounds
  • Error handling works
  • Integration with next phase

Test Coverage by Phase

CI/QA Gates

Pre-commit

# Automatic checks before commit
- Lint (ESLint)
- Type check (TypeScript)
- Unit tests
- File size check (<800 lines)

Pre-merge

# CI pipeline checks
- All unit tests pass
- Integration tests pass
- Coverage threshold met (>80%)
- Performance benchmarks pass
- Build succeeds

Pre-deploy

# Deployment gates
- E2E tests pass
- Load tests pass
- No critical errors in logs
- Smoke tests on staging
- Performance regression check

Quality Dashboards

Metrics Tracked

Alert Thresholds

const DASHBOARD_ALERTS = {
// Performance
TTFC_P95: {
warning: 1200, // >1.2s
critical: 2000 // >2s
},

// Quality
HALLUCINATION_RATE: {
warning: 0.02, // >2%
critical: 0.05 // >5%
},

// Reliability
ERROR_RATE: {
warning: 0.02, // >2%
critical: 0.05 // >5%
},

TIMEOUT_RATE: {
warning: 0.02,
critical: 0.05
}
};

Diagnostic Events

Event Structure

interface DiagnosticEvent {
timestamp: number,
stage: string,
duration: number,
success: boolean,
metadata: Record<string, any>
}

Collection Points

// Emit at every major step
diagnostics.emit({
stage: 'context-extraction',
duration: 900,
success: true,
metadata: {
intent: 'product_search',
confidence: 0.95,
productType: 'Raamat'
}
});

diagnostics.emit({
stage: 'search-execution',
duration: 1200,
success: true,
metadata: {
queries: 3,
candidates: 1200,
finalists: 20
}
});

Triage Speed

With proper tagging:

# Find slow context extractions
grep '\[CONTEXT\]' logs | grep 'duration.*[2-9][0-9]{3}'

# Find search failures
grep '\[SEARCH\]' logs | grep 'success: false'

# Find hallucinations
grep '\[VALIDATION\]' logs | grep 'severity: high'

Result: Pin failures in seconds, not hours

Testing Best Practices

Do's

  • Test one thing per test
  • Use descriptive test names
  • Mock external dependencies
  • Test edge cases
  • Measure performance
  • Clean up test data

Don'ts

  • Don't test implementation details
  • Don't skip error cases
  • Don't ignore flaky tests
  • Don't hardcode test data
  • Don't skip cleanup

Known Gaps & Watch-outs

Frontend Exclude Staleness

Issue: Exclude IDs from current streaming turn not appended before next request.

Impact: Backend dedup relies on stored context; duplicates can slip inside live turn chain.

Mitigation: Backend merge ensures no duplicates across requests.

Validator Language Bias

Issue: Hallucination checks are Estonian-heavy; English patterns less comprehensive.

Impact: English-only flows may be over-flagged.

Mitigation: Expand English pattern library progressively.

Fallback Transparency

Issue: Broadened-category fallbacks rely on LLM to surface meta flags.

Impact: Users may not know pool is depleted.

Mitigation: Ensure UI copy keeps users aware when fallbacks trigger.

How to Extend Safely

Adding New Phases

// 1. Add perf event
yield {
event: 'perf',
data: JSON.stringify({
yourNewPhaseMs: duration
})
};

// 2. Add diagnostic logging
console.log('[YOUR_PHASE]', {
stage: 'execution',
duration,
result
});

// 3. Add tests
describe('Your New Phase', () => {
// Comprehensive test suite
});

// 4. Add monitoring
trackMetric('yourPhase.duration', duration);

Result: Regressions are traceable

Expanding Hallucination Allow-list

//  GOOD: Add domain-specific nouns
const ALLOWED_NOUNS = new Set([
'hobbit', // Character name
'gondor', // Place name
'minas tirith' // Location
]);

// BAD: Loosen thresholds
const THRESHOLD = 0.2; // Too permissive!

Tweaking Funnel Rules

# 1. Modify funnel logic
# 2. Re-run full phase suite
./tests/run-all-phase-tests.sh

# 3. Watch perf pills in UI
# 4. Verify TTFC and pool sizes

Monitoring Integration

Metrics Export

// Export to monitoring service
function exportMetrics() {
return {
timestamp: Date.now(),

// Performance
ttfc_p50: percentile(ttfcSamples, 50),
ttfc_p95: percentile(ttfcSamples, 95),
ttfc_p99: percentile(ttfcSamples, 99),

// Quality
hallucination_rate: hallucinationCount / totalRequests,
intent_accuracy: correctIntents / totalIntents,

// Reliability
error_rate: errorCount / totalRequests,
timeout_rate: timeoutCount / totalRequests,
success_rate: successCount / totalRequests
};
}

Alert Integration

// Trigger alerts on threshold breach
if (metrics.ttfc_p95 > THRESHOLDS.TTFC_P95) {
alert({
type: 'performance',
severity: 'warning',
metric: 'ttfc_p95',
value: metrics.ttfc_p95,
threshold: THRESHOLDS.TTFC_P95,
action: 'Investigate backend latency'
});
}