Skip to main content

AI Evaluation Scenarios

The Kingisoovitaja system uses a comprehensive scenario-based evaluation framework to validate the entire AI pipeline from user input to product recommendations. This document explains the evaluation methodology and how it ensures production-quality AI behavior.

Overview

Why Scenario-Based Evaluation?

Traditional unit tests validate individual functions in isolation. However, AI systems require end-to-end validation because:

  1. Pipeline Interdependencies — Intent detection affects context extraction, which affects search, which affects AI response
  2. Language Complexity — Estonian morphology and compound words require real-world query testing
  3. Context Accumulation — Multi-dimensional queries (age + occasion + budget + interest) must work together
  4. Edge Case Coverage — Negations, constraints, and low-signal queries need explicit validation

Evaluation Architecture

Test Suite Location

qa-surface/test-gift-context-understanding.ts

Pipeline Coverage

Each test scenario:

  1. Sends a real HTTP POST request to /api/chat
  2. Parses SSE stream responses
  3. Extracts intent and product type from headers
  4. Counts returned products from metadata events
  5. Measures Time To First Character (TTFC)
  6. Validates against expected outcomes

Scenario Categories

1. Demographics Testing (Age-Based)

Tests the system's ability to understand recipient age and map to appropriate products.

Age GroupExample QueryExpected Behavior
Infant (0-1)"Kingitus 6-kuusele beebile"Mängud (toys)
Toddler (2-3)"Kingitus 3-aastasele lapsele"Mängud (toys)
Child (4-9)"Kingitus 7-aastasele"Mängud or Raamat
Pre-teen (10-12)"Kingitus 12-aastasele poisile"Tehnika or Mängud
Teenager (13-17)"Kingitus 15-aastasele"Tehnika or Mängud
Young Adult (18-25)"Kingitus 25-aastasele"Kingitused
Adult (26-59)"Kingitus 35-aastasele"Kingitused, Kodu ja aed
Senior (60+)"Kingitus 70-aastasele vanaemale"Kingitused, Kodu ja aed

Pipeline Components Tested:

  • Deterministic demographics extraction
  • Age-to-product-type mapping
  • LLM query routing based on age context

2. Occasion Recognition

Tests detection of Estonian holidays and gift-giving occasions.

// Example scenario structure
{
scenarioNumber: 10,
description: "Valentine's Day romantic",
userMessage: "Valentinipäeva kingitus kallimale",
expectedContext: {
acceptableProductTypes: ['Kingitused', 'Ilu ja stiil', 'Joodav ja söödav'],
expectedOccasion: 'valentinipäev',
shouldExcludeBooks: true,
minProducts: 1
},
tags: ['occasion-valentines', 'relationship-partner']
}

Occasions Covered:

  • Valentinipäev (Valentine's Day)
  • Jõulud (Christmas)
  • Emadepäev (Mother's Day)
  • Isadepäev (Father's Day)
  • Sünnipäev (Birthday)
  • Sissekolimine (Housewarming)
  • Lõpetamine (Graduation)
  • Pulmad (Wedding)
  • Aastapäev (Anniversary)
  • Pension (Retirement)

3. Budget Constraint Handling

Tests price signal detection and budget filtering.

Budget SignalExampleExpected Behavior
Low (under 10 EUR)"Odav kingitus alla 10 euro"Filter to budget range
Mid (10-30€)"Eelarve 15 eurot"Apply budget constraint
High (30-50€)"Kuni 45 euro"Quality items within budget
Price adjectives"Odav" / "Kallis"Interpret as budget signal

4. Gender-Aware Recommendations

Tests gender detection from Estonian morphology.

GenderIndicatorsProduct Type Tendency
Male (poiss/mees)"-le poisile", "-le mehele"Tehnika, Mängud
Female (tüdruk/naine)"-le tüdrukule", "-le naisele"Ilu ja stiil, Kingitused

5. Recipient/Relationship Context

Tests relationship-based product selection.

RelationshipEstonian TermProduct Type
Friendsõber/sõbraleKingitused
Colleaguekolleeg/kolleegileKontorikaup
Bossülemus/ülemuseleKontorikaup
Siblingvend/õdeKingitused
PartnerkallimKodu ja aed, Ilu ja stiil
TeacherõpetajaKontorikaup

6. Hobby/Interest Detection

Tests hobby keyword recognition and mapping.

const HOBBY_MAPPINGS = {
'lugeda': 'Raamat', // Reading
'fotograafia': 'Tehnika', // Photography
'treenimas': 'Tehnika', // Fitness
'kohvi': 'Joodav ja söödav', // Coffee
'lauamäng': 'Mängud', // Board games
'reisida': 'Kingitused', // Travel
'tee': 'Kodu ja aed', // Tea
'käsitöö': 'Kontorikaup', // Crafts
'jooga': 'Sport ja harrastused' // Yoga
};

7. Negation/Constraint Handling

Tests the unified NegationService — a comprehensive negation detection system that identifies product types users want to exclude from search results.

NegationService Architecture

Negation Test Suite Location

qa-surface/test-negation-service.ts

Run Commands:

# Run all negation tests
npx tsx qa-surface/test-negation-service.ts

# Run specific scenario
npx tsx qa-surface/test-negation-service.ts --scenario=11

# Run by tag (estonian, english, reverse, regression)
npx tsx qa-surface/test-negation-service.ts --tag=estonian

Estonian Negation Patterns

The deterministic detector supports three primary Estonian negation constructs:

PatternTrigger WordsExampleDetected Type
mittemitte, kindlasti mitte"mitte raamatuid"Raamat
ei tahaei taha, ei soovi"ei taha mänge"Mängud
ilmailma...ta"ilma tehnikata"Tehnika

Standard Word Order:

"Kingitus emale, mitte raamatuid"     → Excludes: Raamat
"Otsin kingitust, ei taha tehnikat" → Excludes: Tehnika

Reverse Word Order (Estonian-specific):

"Raamatuid ei taha"                   → Excludes: Raamat
"Mänge ei soovi" → Excludes: Mängud

English Negation Patterns

PatternExampleDetected Type
no"no books please"Raamat
not"not electronics"Tehnika
without"without games"Mängud
don't want"don't want books"Raamat
avoid"avoid cosmetics"Ilu ja stiil

All 12 Product Types Supported

Product TypeEstonian AliasesEnglish Aliases
Raamatraamat, raamatu, raamatuidbook, books
Mängudmäng, mängu, mängegame, games, toys
Tehnikatehnika, tehnikat, elektroonikatelectronics, tech
Ilu ja stiilkosmeetika, ilu, iluasibeauty, cosmetics
Joodav ja söödavsöök, jook, toitufood, drink
Kontorikaupkontor, kontoritarbedoffice, stationery
Sport ja harrastusedsport, sporditarbedsports, fitness
Filmfilm, filme, dvdfilm, movie, dvd
Muusikamuusika, plaat, vinüülmusic, vinyl
Kinkekaartkinkekaart, kinkekaartegift card, voucher
Kodu ja aedkodu, aed, kodutarbedhome, garden, decor
Kingitusedkingitus, kingitusigift, gifts

Negation Scenario Coverage

CategoryScenariosDescription
Estonian "mitte"4Standard "not X" patterns
Estonian "ei taha"4"Don't want X" patterns
Estonian "ilma"2"Without X" patterns
English patterns4no/not/without/don't want
Reverse word order2"Raamatuid ei taha" style
Complex (multi-context)3Age + negation, occasion + negation
No negation2Verify no false positives
Regression2Known problematic patterns
Total21100% pass rate

Validation Criteria

Each negation scenario validates:

  1. Exclusion Detection — Product type correctly identified for exclusion
  2. Search Filtering — Excluded types do NOT appear in results
  3. Constraint Generationexclude_product_type:X constraints created
  4. Minimum Products — Alternative products still returned
// Validation logic
if (scenario.expectedNegation.mustNotIncludeTypes) {
for (const excludedType of scenario.expectedNegation.mustNotIncludeTypes) {
const found = productTypes.some(pt =>
pt.toLowerCase() === excludedType.toLowerCase()
);
if (found) {
// FAIL: Excluded type found in results
allPassed = false;
}
}
}

Example Scenario Structure

{
scenarioNumber: 11,
description: 'ET: raamatuid ei taha (reverse)',
userMessage: 'Raamatuid ei taha, otsin kingitust sõbrale',
expectedNegation: {
shouldDetectNegation: true,
expectedExcludedTypes: ['Raamat'],
mustNotIncludeTypes: ['Raamat'],
minProducts: 1
},
tags: ['estonian', 'ei-taha', 'reverse']
}

Performance

MetricTargetActual
Deterministic detection< 10ms~5ms
LLM timeout500ms maxNon-blocking
Full detection< 550ms~5ms (deterministic)

Backward Compatibility

The NegationService maintains compatibility with legacy constraints:

// Legacy constraint (still supported)
constraints.includes('EXCLUDE_BOOKS')

// New granular constraints (generated by NegationService)
'exclude_product_type:Raamat'
'exclude_product_type:Mängud'
'exclude_category:romance'

8. Low-Signal/Vague Query Handling

Tests graceful degradation with minimal context.

Context LevelExampleExpected Behavior
Age only"Kingitus 9-aastasele"Use age for product type
Occasion only"Sünnipäevakingitus"Use occasion context
Generic"Lihtne kingiidee"Default to Kingitused
Hobby only"Astronoomia huvilisele"Map hobby to products

9. Multi-Language Support

Tests Estonian/English mixed queries.

// Mixed language scenarios
"Gift for traveler, eelarve 60 euro" // English + Estonian budget
"English sci-fi book for 14-year-old boy" // Full English
"Gift for a developer, no books please" // English with constraint

Validation Criteria

Primary Validations

Each scenario validates:

  1. Product Type Match — Response product type in acceptable list
  2. Unknown Type PreventiondisallowUnknownProductType: true fails on "unknown"
  3. Minimum Products — At least N products returned (usually 1-3)

Secondary Validations (Warnings)

  • Product count below expected (database coverage issue)
  • TTFC exceeding thresholds

Validation Code

// Product type validation
const actualProductType = result.logs.productMetadata?.[0]?.searchParams?.product_type
|| result.logs.productType;

if (scenario.expectedContext.disallowUnknownProductType && actualProductType === 'unknown') {
validations.push('❌ ProductType: unknown/empty not allowed');
allPassed = false;
} else if (scenario.expectedContext.acceptableProductTypes) {
const match = scenario.expectedContext.acceptableProductTypes.includes(actualProductType);
if (!match) allPassed = false;
}

// Product count validation
if (scenario.expectedContext.minProducts && productCount < scenario.expectedContext.minProducts) {
validations.push(`⚠️ Products: expected ≥${scenario.expectedContext.minProducts}, got ${productCount}`);
}

Performance Metrics

TTFC (Time To First Character)

Measures responsiveness of the AI system.

TargetThresholdStatus
Fastunder 5sExcellent
Normal5-8sGood
Slow8-12sAcceptable
Very Slowover 12sNeeds optimization

Current Performance Distribution

TTFC under 5s:  18.2% of scenarios
TTFC 5-8s: 45.5% of scenarios
TTFC 8-12s: 27.3% of scenarios
TTFC over 12s: 9.1% of scenarios

Running the Evaluation

Prerequisites

# Ensure dev server is running
npm run dev

# Server must be accessible at localhost:3000

Execute Test Suite

# Run full evaluation
npx tsx qa-surface/test-gift-context-understanding.ts

# Results saved to test-results/gift-context-{timestamp}/

Output Structure

test-results/gift-context-{timestamp}/
├── summary.json # Overall results
├── scenario-1.json # Individual scenario details
├── scenario-2.json
├── ...
└── scenario-99.json

Summary JSON Schema

interface TestSummary {
timestamp: string;
totalScenarios: number;
passed: number;
failed: number;
successRate: string;
results: Array<{
scenarioNumber: number;
description: string;
tags: string[];
success: boolean;
intent: string;
productType: string;
productsReturned: number;
ttfc: number;
failureReason?: string;
}>;
}

Results Interpretation

Success Criteria

MetricTargetCurrent
Overall Pass Rate95%+100%
Intent Detection98%+100%
Product Type Accuracy95%+100%
TTFC under 10s80%+90.9%

Product Type Distribution (99 scenarios)

Product TypeCount%
Kingitused2424.2%
Raamat1515.2%
Mängud1313.1%
Kontorikaup1313.1%
Kodu ja aed1212.1%
Tehnika1010.1%
Joodav ja söödav55.1%
Ilu ja stiil44.0%
Film22.0%
Muusika11.0%

Intent Distribution

IntentCount%
general_gift4444.4%
product_inquiry1515.2%
birthday_gift66.1%
Specific occasion intents3434.3%

Adding New Scenarios

Scenario Structure

interface GiftTestScenario {
scenarioNumber: number;
description: string;
userMessage: string; // Estonian or English query
expectedContext: {
expectedIntent?: string; // Exact intent match
acceptableIntents?: string[]; // Any of these intents OK
expectedProductType?: string; // Exact product type
acceptableProductTypes?: string[]; // Any of these types OK
expectedOccasion?: string; // Occasion detection
expectedAge?: number | { min: number; max: number };
expectedGender?: string;
expectedBudget?: { max?: number };
shouldExcludeBooks?: boolean; // Book exclusion check
minProducts?: number; // Minimum products returned
mustIncludeConstraints?: string[]; // Required constraints
disallowUnknownProductType?: boolean; // Fail on "unknown"
};
tags: string[]; // Categorization tags
}

Example: Adding a New Scenario

{
scenarioNumber: 100,
description: 'Pet lover gift with budget',
userMessage: 'Kingitus lemmikloomaarmastajale, eelarve 25 eurot',
expectedContext: {
acceptableProductTypes: ['Kingitused', 'Kodu ja aed'],
expectedBudget: { max: 25 },
minProducts: 1,
disallowUnknownProductType: true
},
tags: ['hobby-pets', 'budget-low', 'new-scenario']
}

Pipeline Components Validated

1. Intent Detection Layer

Validated Intents:

  • general_gift, birthday_gift, valentines_day_gift
  • holiday_gift, mothers_day_gift, fathers_day_gift
  • housewarming_gift, wedding_gift, graduation_gift
  • baby_gift, retirement_gift, promotion_gift
  • product_search, product_inquiry

2. Context Extraction Layer

Extracted Context Fields:

  • productType — Product category
  • csv_category — Specific category mapping
  • budget — Price constraints
  • age — Recipient age
  • gender — Recipient gender
  • occasion — Gift occasion
  • interests — Hobbies and interests
  • constraints — Exclusions and requirements

3. Search Layer

Validated Behaviors:

  • Product type filtering works correctly
  • Budget constraints applied
  • Exclusion filters (no books, no games) work
  • Minimum product count returned

4. Response Generation

Validated Behaviors:

  • SSE streaming functional
  • Product metadata events emitted
  • Headers contain intent and product type
  • TTFC within acceptable range

Continuous Improvement

When to Add Scenarios

  1. New feature added — Add scenarios covering the feature
  2. Bug discovered — Add regression scenario
  3. Edge case found — Add edge case coverage
  4. New product type — Add type-specific scenarios
  5. Language expansion — Add language-specific tests

Maintenance Guidelines

  1. Run before deployment — All 99 scenarios must pass
  2. Review failures — Investigate any regression
  3. Update expectations — Adjust if business logic changes
  4. Track TTFC trends — Monitor performance over time

Conclusion

The evaluation framework provides comprehensive coverage of the gift recommendation pipeline:

Test Suites

SuiteScenariosPass RateFocus
Gift Context Understanding99100%Full pipeline validation
Negation Service21100%Exclusion detection
Total120100%End-to-end coverage

Key Capabilities Validated

  • 100% pass rate across all scenarios
  • Full pipeline validation from query to response
  • Multi-dimensional testing (age, gender, occasion, budget, interest, constraints)
  • Estonian language excellence with English fallback
  • Negation/exclusion handling for all 12 product types
  • Reverse word order Estonian pattern support
  • Performance monitoring via TTFC metrics

This evaluation framework ensures production-quality AI behavior and catches regressions before deployment.