Testing & Observability Guardrails
The final quality layer ensures continuous validation through comprehensive testing and real-time observability.
Purpose
Testing & observability guardrails ensure:
- Regression prevention through automated tests
- Performance tracking with detailed instrumentation
- Quick triage with diagnostic events
- Continuous quality through CI/CD gates
Performance & Diagnostics
Frontend Timing
Collected on client side:
{
// User-facing timing
requestStartTime: number,
firstEventTime: number,
ttfcMs: number, // Time to First Chunk
totalDurationMs: number,
// Detailed breakdown
contextExtractionMs: number, // From perf event
searchDurationMs: number, // From perf event
rerankTimeMs: number, // From perf event
diversityTimeMs: number, // From perf event
// Network timing
networkLatency: number // Derived
}
Backend Perf Events
Emitted via SSE:
yield {
event: 'perf',
data: JSON.stringify({
contextExtractionMs: 900,
multiSearchMs: 200,
funnelMs: 50,
rerankMs: 150,
diversityMs: 10,
ttfcHintMs: 800
})
};
Correlation:
Frontend TTFC: 750ms
Backend hint: 800ms
→ Network latency: ~50ms (good)
Frontend TTFC: 1500ms
Backend hint: 800ms
→ Network latency: ~700ms (investigate!)
Log Tagging by Stage
Structured logging for quick triage:
// Context Orchestration
console.log('[CONTEXT]', {
stage: 'extraction',
duration: 900,
intent: 'product_search',
confidence: 0.95
});
// Search Pipeline
console.log('[SEARCH]', {
stage: 'multi-search',
queries: 3,
candidates: 1200,
duration: 200
});
// Response Generation
console.log('[GENERATION]', {
stage: 'streaming',
tokens: 150,
duration: 2000,
productsMentioned: 3
});
Benefit: Filter logs by [CONTEXT], [SEARCH], etc. to isolate issues
Runbook-Grade Tests
Seven Phase Test Scripts
# Run all phase tests
./tests/run-all-phase-tests.sh
# Individual phases
./tests/phase-0-context.test.sh
./tests/phase-1-funnel.test.sh
./tests/phase-2-rerank.test.sh
./tests/phase-3-diversity.test.sh
./tests/phase-4-persistence.test.sh
./tests/phase-5-cultural-rules.test.sh
./tests/phase-6-embeddings.test.sh
Each script proves:
- Phase input/output contract
- Performance within bounds
- Error handling works
- Integration with next phase
Test Coverage by Phase
CI/QA Gates
Pre-commit
# Automatic checks before commit
- Lint (ESLint)
- Type check (TypeScript)
- Unit tests
- File size check (<800 lines)
Pre-merge
# CI pipeline checks
- All unit tests pass
- Integration tests pass
- Coverage threshold met (>80%)
- Performance benchmarks pass
- Build succeeds
Pre-deploy
# Deployment gates
- E2E tests pass
- Load tests pass
- No critical errors in logs
- Smoke tests on staging
- Performance regression check
Quality Dashboards
Metrics Tracked
Alert Thresholds
const DASHBOARD_ALERTS = {
// Performance
TTFC_P95: {
warning: 1200, // >1.2s
critical: 2000 // >2s
},
// Quality
HALLUCINATION_RATE: {
warning: 0.02, // >2%
critical: 0.05 // >5%
},
// Reliability
ERROR_RATE: {
warning: 0.02, // >2%
critical: 0.05 // >5%
},
TIMEOUT_RATE: {
warning: 0.02,
critical: 0.05
}
};
Diagnostic Events
Event Structure
interface DiagnosticEvent {
timestamp: number,
stage: string,
duration: number,
success: boolean,
metadata: Record<string, any>
}
Collection Points
// Emit at every major step
diagnostics.emit({
stage: 'context-extraction',
duration: 900,
success: true,
metadata: {
intent: 'product_search',
confidence: 0.95,
productType: 'Raamat'
}
});
diagnostics.emit({
stage: 'search-execution',
duration: 1200,
success: true,
metadata: {
queries: 3,
candidates: 1200,
finalists: 20
}
});
Triage Speed
With proper tagging:
# Find slow context extractions
grep '\[CONTEXT\]' logs | grep 'duration.*[2-9][0-9]{3}'
# Find search failures
grep '\[SEARCH\]' logs | grep 'success: false'
# Find hallucinations
grep '\[VALIDATION\]' logs | grep 'severity: high'
Result: Pin failures in seconds, not hours
Testing Best Practices
Do's
- Test one thing per test
- Use descriptive test names
- Mock external dependencies
- Test edge cases
- Measure performance
- Clean up test data
Don'ts
- Don't test implementation details
- Don't skip error cases
- Don't ignore flaky tests
- Don't hardcode test data
- Don't skip cleanup
Known Gaps & Watch-outs
Frontend Exclude Staleness
Issue: Exclude IDs from current streaming turn not appended before next request.
Impact: Backend dedup relies on stored context; duplicates can slip inside live turn chain.
Mitigation: Backend merge ensures no duplicates across requests.
Validator Language Bias
Issue: Hallucination checks are Estonian-heavy; English patterns less comprehensive.
Impact: English-only flows may be over-flagged.
Mitigation: Expand English pattern library progressively.
Fallback Transparency
Issue: Broadened-category fallbacks rely on LLM to surface meta flags.
Impact: Users may not know pool is depleted.
Mitigation: Ensure UI copy keeps users aware when fallbacks trigger.
How to Extend Safely
Adding New Phases
// 1. Add perf event
yield {
event: 'perf',
data: JSON.stringify({
yourNewPhaseMs: duration
})
};
// 2. Add diagnostic logging
console.log('[YOUR_PHASE]', {
stage: 'execution',
duration,
result
});
// 3. Add tests
describe('Your New Phase', () => {
// Comprehensive test suite
});
// 4. Add monitoring
trackMetric('yourPhase.duration', duration);
Result: Regressions are traceable
Expanding Hallucination Allow-list
// GOOD: Add domain-specific nouns
const ALLOWED_NOUNS = new Set([
'hobbit', // Character name
'gondor', // Place name
'minas tirith' // Location
]);
// BAD: Loosen thresholds
const THRESHOLD = 0.2; // Too permissive!
Tweaking Funnel Rules
# 1. Modify funnel logic
# 2. Re-run full phase suite
./tests/run-all-phase-tests.sh
# 3. Watch perf pills in UI
# 4. Verify TTFC and pool sizes
Monitoring Integration
Metrics Export
// Export to monitoring service
function exportMetrics() {
return {
timestamp: Date.now(),
// Performance
ttfc_p50: percentile(ttfcSamples, 50),
ttfc_p95: percentile(ttfcSamples, 95),
ttfc_p99: percentile(ttfcSamples, 99),
// Quality
hallucination_rate: hallucinationCount / totalRequests,
intent_accuracy: correctIntents / totalIntents,
// Reliability
error_rate: errorCount / totalRequests,
timeout_rate: timeoutCount / totalRequests,
success_rate: successCount / totalRequests
};
}
Alert Integration
// Trigger alerts on threshold breach
if (metrics.ttfc_p95 > THRESHOLDS.TTFC_P95) {
alert({
type: 'performance',
severity: 'warning',
metric: 'ttfc_p95',
value: metrics.ttfc_p95,
threshold: THRESHOLDS.TTFC_P95,
action: 'Investigate backend latency'
});
}
Related Documentation
- Overview - Quality philosophy
- Frontend Guardrails - Phase 1
- Context Guardrails - Phase 2
- Search Guardrails - Phase 3
- Response Guardrails - Phase 4
- Validation Guardrails - Phase 5