AI Evaluation Scenarios

The Kingisoovitaja system uses a comprehensive scenario-based evaluation framework to validate the entire AI pipeline from user input to product recommendations. This document explains the evaluation methodology and how it ensures production-quality AI behavior.

Overview

Why Scenario-Based Evaluation?

Traditional unit tests validate individual functions in isolation. However, AI systems require end-to-end validation because:

Pipeline Interdependencies — Intent detection affects context extraction, which affects search, which affects AI response
Language Complexity — Estonian morphology and compound words require real-world query testing
Context Accumulation — Multi-dimensional queries (age + occasion + budget + interest) must work together
Edge Case Coverage — Negations, constraints, and low-signal queries need explicit validation

Evaluation Architecture

Test Suite Location

qa-surface/test-gift-context-understanding.ts

Pipeline Coverage

Each test scenario:

Sends a real HTTP POST request to /api/chat
Parses SSE stream responses
Extracts intent and product type from headers
Counts returned products from metadata events
Measures Time To First Character (TTFC)
Validates against expected outcomes

Scenario Categories

1. Demographics Testing (Age-Based)

Tests the system's ability to understand recipient age and map to appropriate products.

Age Group	Example Query	Expected Behavior
Infant (0-1)	"Kingitus 6-kuusele beebile"	Mängud (toys)
Toddler (2-3)	"Kingitus 3-aastasele lapsele"	Mängud (toys)
Child (4-9)	"Kingitus 7-aastasele"	Mängud or Raamat
Pre-teen (10-12)	"Kingitus 12-aastasele poisile"	Tehnika or Mängud
Teenager (13-17)	"Kingitus 15-aastasele"	Tehnika or Mängud
Young Adult (18-25)	"Kingitus 25-aastasele"	Kingitused
Adult (26-59)	"Kingitus 35-aastasele"	Kingitused, Kodu ja aed
Senior (60+)	"Kingitus 70-aastasele vanaemale"	Kingitused, Kodu ja aed

Pipeline Components Tested:

Deterministic demographics extraction
Age-to-product-type mapping
LLM query routing based on age context

2. Occasion Recognition

Tests detection of Estonian holidays and gift-giving occasions.

// Example scenario structure
{
  scenarioNumber: 10,
  description: "Valentine's Day romantic",
  userMessage: "Valentinipäeva kingitus kallimale",
  expectedContext: {
    acceptableProductTypes: ['Kingitused', 'Ilu ja stiil', 'Joodav ja söödav'],
    expectedOccasion: 'valentinipäev',
    shouldExcludeBooks: true,
    minProducts: 1
  },
  tags: ['occasion-valentines', 'relationship-partner']
}

Occasions Covered:

Valentinipäev (Valentine's Day)
Jõulud (Christmas)
Emadepäev (Mother's Day)
Isadepäev (Father's Day)
Sünnipäev (Birthday)
Sissekolimine (Housewarming)
Lõpetamine (Graduation)
Pulmad (Wedding)
Aastapäev (Anniversary)
Pension (Retirement)

3. Budget Constraint Handling

Tests price signal detection and budget filtering.

Budget Signal	Example	Expected Behavior
Low (under 10 EUR)	"Odav kingitus alla 10 euro"	Filter to budget range
Mid (10-30€)	"Eelarve 15 eurot"	Apply budget constraint
High (30-50€)	"Kuni 45 euro"	Quality items within budget
Price adjectives	"Odav" / "Kallis"	Interpret as budget signal

4. Gender-Aware Recommendations

Tests gender detection from Estonian morphology.

Gender	Indicators	Product Type Tendency
Male (poiss/mees)	"-le poisile", "-le mehele"	Tehnika, Mängud
Female (tüdruk/naine)	"-le tüdrukule", "-le naisele"	Ilu ja stiil, Kingitused

5. Recipient/Relationship Context

Tests relationship-based product selection.

Relationship	Estonian Term	Product Type
Friend	sõber/sõbrale	Kingitused
Colleague	kolleeg/kolleegile	Kontorikaup
Boss	ülemus/ülemusele	Kontorikaup
Sibling	vend/õde	Kingitused
Partner	kallim	Kodu ja aed, Ilu ja stiil
Teacher	õpetaja	Kontorikaup

6. Hobby/Interest Detection

Tests hobby keyword recognition and mapping.

const HOBBY_MAPPINGS = {
  'lugeda': 'Raamat',           // Reading
  'fotograafia': 'Tehnika',     // Photography
  'treenimas': 'Tehnika',       // Fitness
  'kohvi': 'Joodav ja söödav',  // Coffee
  'lauamäng': 'Mängud',         // Board games
  'reisida': 'Kingitused',      // Travel
  'tee': 'Kodu ja aed',         // Tea
  'käsitöö': 'Kontorikaup',     // Crafts
  'jooga': 'Sport ja harrastused' // Yoga
};

7. Negation/Constraint Handling

Tests the unified NegationService — a comprehensive negation detection system that identifies product types users want to exclude from search results.

NegationService Architecture

Negation Test Suite Location

qa-surface/test-negation-service.ts

Run Commands:

# Run all negation tests
npx tsx qa-surface/test-negation-service.ts

# Run specific scenario
npx tsx qa-surface/test-negation-service.ts --scenario=11

# Run by tag (estonian, english, reverse, regression)
npx tsx qa-surface/test-negation-service.ts --tag=estonian

Estonian Negation Patterns

The deterministic detector supports three primary Estonian negation constructs:

Pattern	Trigger Words	Example	Detected Type
mitte	mitte, kindlasti mitte	"mitte raamatuid"	Raamat
ei taha	ei taha, ei soovi	"ei taha mänge"	Mängud
ilma	ilma...ta	"ilma tehnikata"	Tehnika

Standard Word Order:

"Kingitus emale, mitte raamatuid"     → Excludes: Raamat
"Otsin kingitust, ei taha tehnikat"   → Excludes: Tehnika

Reverse Word Order (Estonian-specific):

"Raamatuid ei taha"                   → Excludes: Raamat
"Mänge ei soovi"                      → Excludes: Mängud

English Negation Patterns

Pattern	Example	Detected Type
no	"no books please"	Raamat
not	"not electronics"	Tehnika
without	"without games"	Mängud
don't want	"don't want books"	Raamat
avoid	"avoid cosmetics"	Ilu ja stiil

All 12 Product Types Supported

Product Type	Estonian Aliases	English Aliases
Raamat	raamat, raamatu, raamatuid	book, books
Mängud	mäng, mängu, mänge	game, games, toys
Tehnika	tehnika, tehnikat, elektroonikat	electronics, tech
Ilu ja stiil	kosmeetika, ilu, iluasi	beauty, cosmetics
Joodav ja söödav	söök, jook, toitu	food, drink
Kontorikaup	kontor, kontoritarbed	office, stationery
Sport ja harrastused	sport, sporditarbed	sports, fitness
Film	film, filme, dvd	film, movie, dvd
Muusika	muusika, plaat, vinüül	music, vinyl
Kinkekaart	kinkekaart, kinkekaarte	gift card, voucher
Kodu ja aed	kodu, aed, kodutarbed	home, garden, decor
Kingitused	kingitus, kingitusi	gift, gifts

Negation Scenario Coverage

Category	Scenarios	Description
Estonian "mitte"	4	Standard "not X" patterns
Estonian "ei taha"	4	"Don't want X" patterns
Estonian "ilma"	2	"Without X" patterns
English patterns	4	no/not/without/don't want
Reverse word order	2	"Raamatuid ei taha" style
Complex (multi-context)	3	Age + negation, occasion + negation
No negation	2	Verify no false positives
Regression	2	Known problematic patterns
Total	21	100% pass rate

Validation Criteria

Each negation scenario validates:

Exclusion Detection — Product type correctly identified for exclusion
Search Filtering — Excluded types do NOT appear in results
Constraint Generation — exclude_product_type:X constraints created
Minimum Products — Alternative products still returned

// Validation logic
if (scenario.expectedNegation.mustNotIncludeTypes) {
  for (const excludedType of scenario.expectedNegation.mustNotIncludeTypes) {
    const found = productTypes.some(pt => 
      pt.toLowerCase() === excludedType.toLowerCase()
    );
    if (found) {
      // FAIL: Excluded type found in results
      allPassed = false;
    }
  }
}

Example Scenario Structure

{
  scenarioNumber: 11,
  description: 'ET: raamatuid ei taha (reverse)',
  userMessage: 'Raamatuid ei taha, otsin kingitust sõbrale',
  expectedNegation: {
    shouldDetectNegation: true,
    expectedExcludedTypes: ['Raamat'],
    mustNotIncludeTypes: ['Raamat'],
    minProducts: 1
  },
  tags: ['estonian', 'ei-taha', 'reverse']
}

Performance

Metric	Target	Actual
Deterministic detection	< 10ms	~5ms
LLM timeout	500ms max	Non-blocking
Full detection	< 550ms	~5ms (deterministic)

Backward Compatibility

The NegationService maintains compatibility with legacy constraints:

// Legacy constraint (still supported)
constraints.includes('EXCLUDE_BOOKS')

// New granular constraints (generated by NegationService)
'exclude_product_type:Raamat'
'exclude_product_type:Mängud'
'exclude_category:romance'

8. Low-Signal/Vague Query Handling

Tests graceful degradation with minimal context.

Context Level	Example	Expected Behavior
Age only	"Kingitus 9-aastasele"	Use age for product type
Occasion only	"Sünnipäevakingitus"	Use occasion context
Generic	"Lihtne kingiidee"	Default to Kingitused
Hobby only	"Astronoomia huvilisele"	Map hobby to products

9. Multi-Language Support

Tests Estonian/English mixed queries.

// Mixed language scenarios
"Gift for traveler, eelarve 60 euro"     // English + Estonian budget
"English sci-fi book for 14-year-old boy" // Full English
"Gift for a developer, no books please"   // English with constraint

Validation Criteria

Primary Validations

Each scenario validates:

Product Type Match — Response product type in acceptable list
Unknown Type Prevention — disallowUnknownProductType: true fails on "unknown"
Minimum Products — At least N products returned (usually 1-3)

Secondary Validations (Warnings)

Product count below expected (database coverage issue)
TTFC exceeding thresholds

Validation Code

// Product type validation
const actualProductType = result.logs.productMetadata?.[0]?.searchParams?.product_type 
  || result.logs.productType;

if (scenario.expectedContext.disallowUnknownProductType && actualProductType === 'unknown') {
  validations.push('❌ ProductType: unknown/empty not allowed');
  allPassed = false;
} else if (scenario.expectedContext.acceptableProductTypes) {
  const match = scenario.expectedContext.acceptableProductTypes.includes(actualProductType);
  if (!match) allPassed = false;
}

// Product count validation
if (scenario.expectedContext.minProducts && productCount < scenario.expectedContext.minProducts) {
  validations.push(`⚠️ Products: expected ≥${scenario.expectedContext.minProducts}, got ${productCount}`);
}

Performance Metrics

TTFC (Time To First Character)

Measures responsiveness of the AI system.

Target	Threshold	Status
Fast	under 5s	Excellent
Normal	5-8s	Good
Slow	8-12s	Acceptable
Very Slow	over 12s	Needs optimization

Current Performance Distribution

TTFC under 5s:  18.2% of scenarios
TTFC 5-8s:      45.5% of scenarios
TTFC 8-12s:     27.3% of scenarios
TTFC over 12s:   9.1% of scenarios

Running the Evaluation

Prerequisites

# Ensure dev server is running
npm run dev

# Server must be accessible at localhost:3000

Execute Test Suite

# Run full evaluation
npx tsx qa-surface/test-gift-context-understanding.ts

# Results saved to test-results/gift-context-{timestamp}/

Output Structure

test-results/gift-context-{timestamp}/
├── summary.json           # Overall results
├── scenario-1.json        # Individual scenario details
├── scenario-2.json
├── ...
└── scenario-99.json

Summary JSON Schema

interface TestSummary {
  timestamp: string;
  totalScenarios: number;
  passed: number;
  failed: number;
  successRate: string;
  results: Array<{
    scenarioNumber: number;
    description: string;
    tags: string[];
    success: boolean;
    intent: string;
    productType: string;
    productsReturned: number;
    ttfc: number;
    failureReason?: string;
  }>;
}

Results Interpretation

Success Criteria

Metric	Target	Current
Overall Pass Rate	95%+	100%
Intent Detection	98%+	100%
Product Type Accuracy	95%+	100%
TTFC under 10s	80%+	90.9%

Product Type Distribution (99 scenarios)

Product Type	Count	%
Kingitused	24	24.2%
Raamat	15	15.2%
Mängud	13	13.1%
Kontorikaup	13	13.1%
Kodu ja aed	12	12.1%
Tehnika	10	10.1%
Joodav ja söödav	5	5.1%
Ilu ja stiil	4	4.0%
Film	2	2.0%
Muusika	1	1.0%

Intent Distribution

Intent	Count	%
general_gift	44	44.4%
product_inquiry	15	15.2%
birthday_gift	6	6.1%
Specific occasion intents	34	34.3%

Adding New Scenarios

Scenario Structure

interface GiftTestScenario {
  scenarioNumber: number;
  description: string;
  userMessage: string;                    // Estonian or English query
  expectedContext: {
    expectedIntent?: string;              // Exact intent match
    acceptableIntents?: string[];         // Any of these intents OK
    expectedProductType?: string;         // Exact product type
    acceptableProductTypes?: string[];    // Any of these types OK
    expectedOccasion?: string;            // Occasion detection
    expectedAge?: number | { min: number; max: number };
    expectedGender?: string;
    expectedBudget?: { max?: number };
    shouldExcludeBooks?: boolean;         // Book exclusion check
    minProducts?: number;                 // Minimum products returned
    mustIncludeConstraints?: string[];    // Required constraints
    disallowUnknownProductType?: boolean; // Fail on "unknown"
  };
  tags: string[];                         // Categorization tags
}

Example: Adding a New Scenario

{
  scenarioNumber: 100,
  description: 'Pet lover gift with budget',
  userMessage: 'Kingitus lemmikloomaarmastajale, eelarve 25 eurot',
  expectedContext: {
    acceptableProductTypes: ['Kingitused', 'Kodu ja aed'],
    expectedBudget: { max: 25 },
    minProducts: 1,
    disallowUnknownProductType: true
  },
  tags: ['hobby-pets', 'budget-low', 'new-scenario']
}

Pipeline Components Validated

1. Intent Detection Layer

Validated Intents:

general_gift, birthday_gift, valentines_day_gift
holiday_gift, mothers_day_gift, fathers_day_gift
housewarming_gift, wedding_gift, graduation_gift
baby_gift, retirement_gift, promotion_gift
product_search, product_inquiry

2. Context Extraction Layer

Extracted Context Fields:

productType — Product category
csv_category — Specific category mapping
budget — Price constraints
age — Recipient age
gender — Recipient gender
occasion — Gift occasion
interests — Hobbies and interests
constraints — Exclusions and requirements

3. Search Layer

Validated Behaviors:

Product type filtering works correctly
Budget constraints applied
Exclusion filters (no books, no games) work
Minimum product count returned

4. Response Generation

Validated Behaviors:

SSE streaming functional
Product metadata events emitted
Headers contain intent and product type
TTFC within acceptable range

Continuous Improvement

When to Add Scenarios

New feature added — Add scenarios covering the feature
Bug discovered — Add regression scenario
Edge case found — Add edge case coverage
New product type — Add type-specific scenarios
Language expansion — Add language-specific tests

Maintenance Guidelines

Run before deployment — All 99 scenarios must pass
Review failures — Investigate any regression
Update expectations — Adjust if business logic changes
Track TTFC trends — Monitor performance over time

Testing Strategy — Overall testing approach
Performance Monitoring — Metrics and benchmarks
Context Guardrails — Context extraction rules
Estonian Best Practices — Language handling

Conclusion

The evaluation framework provides comprehensive coverage of the gift recommendation pipeline:

Test Suites

Suite	Scenarios	Pass Rate	Focus
Gift Context Understanding	99	100%	Full pipeline validation
Negation Service	21	100%	Exclusion detection
Total	120	100%	End-to-end coverage

Key Capabilities Validated

✅ 100% pass rate across all scenarios
✅ Full pipeline validation from query to response
✅ Multi-dimensional testing (age, gender, occasion, budget, interest, constraints)
✅ Estonian language excellence with English fallback
✅ Negation/exclusion handling for all 12 product types
✅ Reverse word order Estonian pattern support
✅ Performance monitoring via TTFC metrics

This evaluation framework ensures production-quality AI behavior and catches regressions before deployment.

Overview​

Why Scenario-Based Evaluation?​

Evaluation Architecture​

Test Suite Location​

Pipeline Coverage​

Scenario Categories​

1. Demographics Testing (Age-Based)​

2. Occasion Recognition​

3. Budget Constraint Handling​

4. Gender-Aware Recommendations​

5. Recipient/Relationship Context​

6. Hobby/Interest Detection​

7. Negation/Constraint Handling​

NegationService Architecture​

Negation Test Suite Location​

Estonian Negation Patterns​

English Negation Patterns​

All 12 Product Types Supported​

Negation Scenario Coverage​

Validation Criteria​

Example Scenario Structure​

Performance​

Backward Compatibility​

8. Low-Signal/Vague Query Handling​

9. Multi-Language Support​

Validation Criteria​

Primary Validations​

Secondary Validations (Warnings)​

Validation Code​

Performance Metrics​

TTFC (Time To First Character)​

Current Performance Distribution​

Running the Evaluation​

Prerequisites​

Execute Test Suite​

Output Structure​

Summary JSON Schema​

Results Interpretation​

Success Criteria​

Product Type Distribution (99 scenarios)​

Intent Distribution​

Adding New Scenarios​

Scenario Structure​

Example: Adding a New Scenario​

Pipeline Components Validated​

1. Intent Detection Layer​

2. Context Extraction Layer​

3. Search Layer​

4. Response Generation​

Continuous Improvement​

When to Add Scenarios​

Maintenance Guidelines​

Related Documentation​

Conclusion​

Test Suites​

Key Capabilities Validated​

Overview

Why Scenario-Based Evaluation?

Evaluation Architecture

Test Suite Location

Pipeline Coverage

Scenario Categories

1. Demographics Testing (Age-Based)

2. Occasion Recognition

3. Budget Constraint Handling

4. Gender-Aware Recommendations

5. Recipient/Relationship Context

6. Hobby/Interest Detection

7. Negation/Constraint Handling

NegationService Architecture

Negation Test Suite Location

Estonian Negation Patterns

English Negation Patterns

All 12 Product Types Supported

Negation Scenario Coverage

Validation Criteria

Example Scenario Structure

Performance

Backward Compatibility

8. Low-Signal/Vague Query Handling

9. Multi-Language Support

Validation Criteria

Primary Validations

Secondary Validations (Warnings)

Validation Code

Performance Metrics

TTFC (Time To First Character)

Current Performance Distribution

Running the Evaluation

Prerequisites

Execute Test Suite

Output Structure

Summary JSON Schema

Results Interpretation

Success Criteria

Product Type Distribution (99 scenarios)

Intent Distribution

Adding New Scenarios

Scenario Structure

Example: Adding a New Scenario

Pipeline Components Validated

1. Intent Detection Layer

2. Context Extraction Layer

3. Search Layer

4. Response Generation

Continuous Improvement

When to Add Scenarios

Maintenance Guidelines

Related Documentation

Conclusion

Test Suites

Key Capabilities Validated