AI Evaluation Scenarios
The Kingisoovitaja system uses a comprehensive scenario-based evaluation framework to validate the entire AI pipeline from user input to product recommendations. This document explains the evaluation methodology and how it ensures production-quality AI behavior.
Overview
Why Scenario-Based Evaluation?
Traditional unit tests validate individual functions in isolation. However, AI systems require end-to-end validation because:
- Pipeline Interdependencies — Intent detection affects context extraction, which affects search, which affects AI response
- Language Complexity — Estonian morphology and compound words require real-world query testing
- Context Accumulation — Multi-dimensional queries (age + occasion + budget + interest) must work together
- Edge Case Coverage — Negations, constraints, and low-signal queries need explicit validation
Evaluation Architecture
Test Suite Location
qa-surface/test-gift-context-understanding.ts
Pipeline Coverage
Each test scenario:
- Sends a real HTTP POST request to
/api/chat - Parses SSE stream responses
- Extracts intent and product type from headers
- Counts returned products from metadata events
- Measures Time To First Character (TTFC)
- Validates against expected outcomes
Scenario Categories
1. Demographics Testing (Age-Based)
Tests the system's ability to understand recipient age and map to appropriate products.
| Age Group | Example Query | Expected Behavior |
|---|---|---|
| Infant (0-1) | "Kingitus 6-kuusele beebile" | Mängud (toys) |
| Toddler (2-3) | "Kingitus 3-aastasele lapsele" | Mängud (toys) |
| Child (4-9) | "Kingitus 7-aastasele" | Mängud or Raamat |
| Pre-teen (10-12) | "Kingitus 12-aastasele poisile" | Tehnika or Mängud |
| Teenager (13-17) | "Kingitus 15-aastasele" | Tehnika or Mängud |
| Young Adult (18-25) | "Kingitus 25-aastasele" | Kingitused |
| Adult (26-59) | "Kingitus 35-aastasele" | Kingitused, Kodu ja aed |
| Senior (60+) | "Kingitus 70-aastasele vanaemale" | Kingitused, Kodu ja aed |
Pipeline Components Tested:
- Deterministic demographics extraction
- Age-to-product-type mapping
- LLM query routing based on age context
2. Occasion Recognition
Tests detection of Estonian holidays and gift-giving occasions.
// Example scenario structure
{
scenarioNumber: 10,
description: "Valentine's Day romantic",
userMessage: "Valentinipäeva kingitus kallimale",
expectedContext: {
acceptableProductTypes: ['Kingitused', 'Ilu ja stiil', 'Joodav ja söödav'],
expectedOccasion: 'valentinipäev',
shouldExcludeBooks: true,
minProducts: 1
},
tags: ['occasion-valentines', 'relationship-partner']
}
Occasions Covered:
- Valentinipäev (Valentine's Day)
- Jõulud (Christmas)
- Emadepäev (Mother's Day)
- Isadepäev (Father's Day)
- Sünnipäev (Birthday)
- Sissekolimine (Housewarming)
- Lõpetamine (Graduation)
- Pulmad (Wedding)
- Aastapäev (Anniversary)
- Pension (Retirement)
3. Budget Constraint Handling
Tests price signal detection and budget filtering.
| Budget Signal | Example | Expected Behavior |
|---|---|---|
| Low (under 10 EUR) | "Odav kingitus alla 10 euro" | Filter to budget range |
| Mid (10-30€) | "Eelarve 15 eurot" | Apply budget constraint |
| High (30-50€) | "Kuni 45 euro" | Quality items within budget |
| Price adjectives | "Odav" / "Kallis" | Interpret as budget signal |
4. Gender-Aware Recommendations
Tests gender detection from Estonian morphology.
| Gender | Indicators | Product Type Tendency |
|---|---|---|
| Male (poiss/mees) | "-le poisile", "-le mehele" | Tehnika, Mängud |
| Female (tüdruk/naine) | "-le tüdrukule", "-le naisele" | Ilu ja stiil, Kingitused |
5. Recipient/Relationship Context
Tests relationship-based product selection.
| Relationship | Estonian Term | Product Type |
|---|---|---|
| Friend | sõber/sõbrale | Kingitused |
| Colleague | kolleeg/kolleegile | Kontorikaup |
| Boss | ülemus/ülemusele | Kontorikaup |
| Sibling | vend/õde | Kingitused |
| Partner | kallim | Kodu ja aed, Ilu ja stiil |
| Teacher | õpetaja | Kontorikaup |
6. Hobby/Interest Detection
Tests hobby keyword recognition and mapping.
const HOBBY_MAPPINGS = {
'lugeda': 'Raamat', // Reading
'fotograafia': 'Tehnika', // Photography
'treenimas': 'Tehnika', // Fitness
'kohvi': 'Joodav ja söödav', // Coffee
'lauamäng': 'Mängud', // Board games
'reisida': 'Kingitused', // Travel
'tee': 'Kodu ja aed', // Tea
'käsitöö': 'Kontorikaup', // Crafts
'jooga': 'Sport ja harrastused' // Yoga
};
7. Negation/Constraint Handling
Tests the unified NegationService — a comprehensive negation detection system that identifies product types users want to exclude from search results.
NegationService Architecture
Negation Test Suite Location
qa-surface/test-negation-service.ts
Run Commands:
# Run all negation tests
npx tsx qa-surface/test-negation-service.ts
# Run specific scenario
npx tsx qa-surface/test-negation-service.ts --scenario=11
# Run by tag (estonian, english, reverse, regression)
npx tsx qa-surface/test-negation-service.ts --tag=estonian
Estonian Negation Patterns
The deterministic detector supports three primary Estonian negation constructs:
| Pattern | Trigger Words | Example | Detected Type |
|---|---|---|---|
| mitte | mitte, kindlasti mitte | "mitte raamatuid" | Raamat |
| ei taha | ei taha, ei soovi | "ei taha mänge" | Mängud |
| ilma | ilma...ta | "ilma tehnikata" | Tehnika |
Standard Word Order:
"Kingitus emale, mitte raamatuid" → Excludes: Raamat
"Otsin kingitust, ei taha tehnikat" → Excludes: Tehnika
Reverse Word Order (Estonian-specific):
"Raamatuid ei taha" → Excludes: Raamat
"Mänge ei soovi" → Excludes: Mängud
English Negation Patterns
| Pattern | Example | Detected Type |
|---|---|---|
| no | "no books please" | Raamat |
| not | "not electronics" | Tehnika |
| without | "without games" | Mängud |
| don't want | "don't want books" | Raamat |
| avoid | "avoid cosmetics" | Ilu ja stiil |
All 12 Product Types Supported
| Product Type | Estonian Aliases | English Aliases |
|---|---|---|
| Raamat | raamat, raamatu, raamatuid | book, books |
| Mängud | mäng, mängu, mänge | game, games, toys |
| Tehnika | tehnika, tehnikat, elektroonikat | electronics, tech |
| Ilu ja stiil | kosmeetika, ilu, iluasi | beauty, cosmetics |
| Joodav ja söödav | söök, jook, toitu | food, drink |
| Kontorikaup | kontor, kontoritarbed | office, stationery |
| Sport ja harrastused | sport, sporditarbed | sports, fitness |
| Film | film, filme, dvd | film, movie, dvd |
| Muusika | muusika, plaat, vinüül | music, vinyl |
| Kinkekaart | kinkekaart, kinkekaarte | gift card, voucher |
| Kodu ja aed | kodu, aed, kodutarbed | home, garden, decor |
| Kingitused | kingitus, kingitusi | gift, gifts |
Negation Scenario Coverage
| Category | Scenarios | Description |
|---|---|---|
| Estonian "mitte" | 4 | Standard "not X" patterns |
| Estonian "ei taha" | 4 | "Don't want X" patterns |
| Estonian "ilma" | 2 | "Without X" patterns |
| English patterns | 4 | no/not/without/don't want |
| Reverse word order | 2 | "Raamatuid ei taha" style |
| Complex (multi-context) | 3 | Age + negation, occasion + negation |
| No negation | 2 | Verify no false positives |
| Regression | 2 | Known problematic patterns |
| Total | 21 | 100% pass rate |
Validation Criteria
Each negation scenario validates:
- Exclusion Detection — Product type correctly identified for exclusion
- Search Filtering — Excluded types do NOT appear in results
- Constraint Generation —
exclude_product_type:Xconstraints created - Minimum Products — Alternative products still returned
// Validation logic
if (scenario.expectedNegation.mustNotIncludeTypes) {
for (const excludedType of scenario.expectedNegation.mustNotIncludeTypes) {
const found = productTypes.some(pt =>
pt.toLowerCase() === excludedType.toLowerCase()
);
if (found) {
// FAIL: Excluded type found in results
allPassed = false;
}
}
}
Example Scenario Structure
{
scenarioNumber: 11,
description: 'ET: raamatuid ei taha (reverse)',
userMessage: 'Raamatuid ei taha, otsin kingitust sõbrale',
expectedNegation: {
shouldDetectNegation: true,
expectedExcludedTypes: ['Raamat'],
mustNotIncludeTypes: ['Raamat'],
minProducts: 1
},
tags: ['estonian', 'ei-taha', 'reverse']
}
Performance
| Metric | Target | Actual |
|---|---|---|
| Deterministic detection | < 10ms | ~5ms |
| LLM timeout | 500ms max | Non-blocking |
| Full detection | < 550ms | ~5ms (deterministic) |
Backward Compatibility
The NegationService maintains compatibility with legacy constraints:
// Legacy constraint (still supported)
constraints.includes('EXCLUDE_BOOKS')
// New granular constraints (generated by NegationService)
'exclude_product_type:Raamat'
'exclude_product_type:Mängud'
'exclude_category:romance'
8. Low-Signal/Vague Query Handling
Tests graceful degradation with minimal context.
| Context Level | Example | Expected Behavior |
|---|---|---|
| Age only | "Kingitus 9-aastasele" | Use age for product type |
| Occasion only | "Sünnipäevakingitus" | Use occasion context |
| Generic | "Lihtne kingiidee" | Default to Kingitused |
| Hobby only | "Astronoomia huvilisele" | Map hobby to products |
9. Multi-Language Support
Tests Estonian/English mixed queries.
// Mixed language scenarios
"Gift for traveler, eelarve 60 euro" // English + Estonian budget
"English sci-fi book for 14-year-old boy" // Full English
"Gift for a developer, no books please" // English with constraint
Validation Criteria
Primary Validations
Each scenario validates:
- Product Type Match — Response product type in acceptable list
- Unknown Type Prevention —
disallowUnknownProductType: truefails on "unknown" - Minimum Products — At least N products returned (usually 1-3)
Secondary Validations (Warnings)
- Product count below expected (database coverage issue)
- TTFC exceeding thresholds
Validation Code
// Product type validation
const actualProductType = result.logs.productMetadata?.[0]?.searchParams?.product_type
|| result.logs.productType;
if (scenario.expectedContext.disallowUnknownProductType && actualProductType === 'unknown') {
validations.push('❌ ProductType: unknown/empty not allowed');
allPassed = false;
} else if (scenario.expectedContext.acceptableProductTypes) {
const match = scenario.expectedContext.acceptableProductTypes.includes(actualProductType);
if (!match) allPassed = false;
}
// Product count validation
if (scenario.expectedContext.minProducts && productCount < scenario.expectedContext.minProducts) {
validations.push(`⚠️ Products: expected ≥${scenario.expectedContext.minProducts}, got ${productCount}`);
}
Performance Metrics
TTFC (Time To First Character)
Measures responsiveness of the AI system.
| Target | Threshold | Status |
|---|---|---|
| Fast | under 5s | Excellent |
| Normal | 5-8s | Good |
| Slow | 8-12s | Acceptable |
| Very Slow | over 12s | Needs optimization |
Current Performance Distribution
TTFC under 5s: 18.2% of scenarios
TTFC 5-8s: 45.5% of scenarios
TTFC 8-12s: 27.3% of scenarios
TTFC over 12s: 9.1% of scenarios
Running the Evaluation
Prerequisites
# Ensure dev server is running
npm run dev
# Server must be accessible at localhost:3000
Execute Test Suite
# Run full evaluation
npx tsx qa-surface/test-gift-context-understanding.ts
# Results saved to test-results/gift-context-{timestamp}/
Output Structure
test-results/gift-context-{timestamp}/
├── summary.json # Overall results
├── scenario-1.json # Individual scenario details
├── scenario-2.json
├── ...
└── scenario-99.json
Summary JSON Schema
interface TestSummary {
timestamp: string;
totalScenarios: number;
passed: number;
failed: number;
successRate: string;
results: Array<{
scenarioNumber: number;
description: string;
tags: string[];
success: boolean;
intent: string;
productType: string;
productsReturned: number;
ttfc: number;
failureReason?: string;
}>;
}
Results Interpretation
Success Criteria
| Metric | Target | Current |
|---|---|---|
| Overall Pass Rate | 95%+ | 100% |
| Intent Detection | 98%+ | 100% |
| Product Type Accuracy | 95%+ | 100% |
| TTFC under 10s | 80%+ | 90.9% |
Product Type Distribution (99 scenarios)
| Product Type | Count | % |
|---|---|---|
| Kingitused | 24 | 24.2% |
| Raamat | 15 | 15.2% |
| Mängud | 13 | 13.1% |
| Kontorikaup | 13 | 13.1% |
| Kodu ja aed | 12 | 12.1% |
| Tehnika | 10 | 10.1% |
| Joodav ja söödav | 5 | 5.1% |
| Ilu ja stiil | 4 | 4.0% |
| Film | 2 | 2.0% |
| Muusika | 1 | 1.0% |
Intent Distribution
| Intent | Count | % |
|---|---|---|
| general_gift | 44 | 44.4% |
| product_inquiry | 15 | 15.2% |
| birthday_gift | 6 | 6.1% |
| Specific occasion intents | 34 | 34.3% |
Adding New Scenarios
Scenario Structure
interface GiftTestScenario {
scenarioNumber: number;
description: string;
userMessage: string; // Estonian or English query
expectedContext: {
expectedIntent?: string; // Exact intent match
acceptableIntents?: string[]; // Any of these intents OK
expectedProductType?: string; // Exact product type
acceptableProductTypes?: string[]; // Any of these types OK
expectedOccasion?: string; // Occasion detection
expectedAge?: number | { min: number; max: number };
expectedGender?: string;
expectedBudget?: { max?: number };
shouldExcludeBooks?: boolean; // Book exclusion check
minProducts?: number; // Minimum products returned
mustIncludeConstraints?: string[]; // Required constraints
disallowUnknownProductType?: boolean; // Fail on "unknown"
};
tags: string[]; // Categorization tags
}
Example: Adding a New Scenario
{
scenarioNumber: 100,
description: 'Pet lover gift with budget',
userMessage: 'Kingitus lemmikloomaarmastajale, eelarve 25 eurot',
expectedContext: {
acceptableProductTypes: ['Kingitused', 'Kodu ja aed'],
expectedBudget: { max: 25 },
minProducts: 1,
disallowUnknownProductType: true
},
tags: ['hobby-pets', 'budget-low', 'new-scenario']
}
Pipeline Components Validated
1. Intent Detection Layer
Validated Intents:
general_gift,birthday_gift,valentines_day_giftholiday_gift,mothers_day_gift,fathers_day_gifthousewarming_gift,wedding_gift,graduation_giftbaby_gift,retirement_gift,promotion_giftproduct_search,product_inquiry
2. Context Extraction Layer
Extracted Context Fields:
productType— Product categorycsv_category— Specific category mappingbudget— Price constraintsage— Recipient agegender— Recipient genderoccasion— Gift occasioninterests— Hobbies and interestsconstraints— Exclusions and requirements
3. Search Layer
Validated Behaviors:
- Product type filtering works correctly
- Budget constraints applied
- Exclusion filters (no books, no games) work
- Minimum product count returned
4. Response Generation
Validated Behaviors:
- SSE streaming functional
- Product metadata events emitted
- Headers contain intent and product type
- TTFC within acceptable range
Continuous Improvement
When to Add Scenarios
- New feature added — Add scenarios covering the feature
- Bug discovered — Add regression scenario
- Edge case found — Add edge case coverage
- New product type — Add type-specific scenarios
- Language expansion — Add language-specific tests
Maintenance Guidelines
- Run before deployment — All 99 scenarios must pass
- Review failures — Investigate any regression
- Update expectations — Adjust if business logic changes
- Track TTFC trends — Monitor performance over time
Related Documentation
- Testing Strategy — Overall testing approach
- Performance Monitoring — Metrics and benchmarks
- Context Guardrails — Context extraction rules
- Estonian Best Practices — Language handling
Conclusion
The evaluation framework provides comprehensive coverage of the gift recommendation pipeline:
Test Suites
| Suite | Scenarios | Pass Rate | Focus |
|---|---|---|---|
| Gift Context Understanding | 99 | 100% | Full pipeline validation |
| Negation Service | 21 | 100% | Exclusion detection |
| Total | 120 | 100% | End-to-end coverage |
Key Capabilities Validated
- ✅ 100% pass rate across all scenarios
- ✅ Full pipeline validation from query to response
- ✅ Multi-dimensional testing (age, gender, occasion, budget, interest, constraints)
- ✅ Estonian language excellence with English fallback
- ✅ Negation/exclusion handling for all 12 product types
- ✅ Reverse word order Estonian pattern support
- ✅ Performance monitoring via TTFC metrics
This evaluation framework ensures production-quality AI behavior and catches regressions before deployment.