Deterministic Demographics Extraction
A regex-based safety net that ensures demographic data (age, gender, recipient, hobbies) is always captured, even when the LLM fails to extract it properly.
Why This Approach Is Needed
The Problem
LLM-based context extraction is unreliable for structured demographic data in Estonian.
Example Query:
"Kingitus 50-aastasele naisele, kes armastab aiandust"
(Gift for a 50-year-old woman who loves gardening)
What the LLM returned:
{
"intent": "author_search",
"authorName": "Tolkien",
"productType": "Raamat"
}
What was expected:
{
"intent": "general_gift",
"recipientAge": 50,
"recipientGender": "female",
"recipient": "naine",
"hobbies": ["aiandus"]
}
The LLM:
- ❌ Hallucinated "Tolkien" as an author (possibly from "armastab" → love → romance?)
- ❌ Returned
author_searchintent instead ofgeneral_gift - ❌ Completely ignored explicit demographics: age=50, gender=female, hobby=gardening
Root Causes of LLM Failure
| Issue | Description |
|---|---|
| Estonian Morphology | Complex case system - "naisele" (dative) vs "naine" (nominative) confuses models |
| Semantic Optimization | LLMs optimize for meaning, not structured extraction |
| Speed vs Accuracy | Fast models (Llama) prioritize speed over precision |
| Prompt Limitations | Instructions are suggestions, not guarantees |
Why Regex Instead of Another LLM Call
| Factor | Regex | LLM |
|---|---|---|
| Reliability | 100% deterministic | Probabilistic |
| Speed | ~1-2ms | 200-2000ms |
| Hallucination | Impossible | Common |
| Testability | Easy unit tests | Complex |
| Maintenance | Add patterns | Prompt engineering |
LLM vs Deterministic: Side-by-Side Comparison
Architecture
High-Level Flow
Pattern Matching Pipeline
Integration Sequence
Supported Patterns
Age Extraction (1-120 years)
Estonian Patterns
// Dative: "50-aastasele", "50 aastasele"
/(\d{1,3})\s*-?\s*aastase?l?e?/i
// Nominative: "50-aastane"
/(\d{1,3})\s*-?\s*aastane\b/i
// Descriptive: "50 aastat vana"
/(\d{1,3})\s*-?\s*aasta(?:t|ne)?\s*van/i
// Locative: "vanuses 50"
/vanuse?s?\s*(\d{1,3})/i
English Patterns
// Standard: "50 year old", "50-year-old"
/(\d{1,3})\s*-?\s*years?\s*-?\s*old/i
// Prefix: "age 50", "aged 50"
/age[d]?\s*(\d{1,3})/i
// Informal: "50 yo", "50y/o"
/(\d{1,3})\s*y(?:\/)?o\b/i
// Contextual: "turning 50", "who is 50"
/turn(?:s|ing)\s*(\d{1,3})/i
/who(?:'s| is)\s*(\d{1,3})/i
Age Ranges
// Estonian: "7-9 aastasele", "7 kuni 9 aastasele"
/(\d{1,3})\s*-\s*(\d{1,3})\s*-?\s*aastase/i
/(\d{1,3})\s+kuni\s+(\d{1,3})\s*-?\s*aastase/i
// English: "7-9 years old", "between 7 and 9"
/(\d{1,3})\s*-\s*(\d{1,3})\s*years?\s*old/i
/between\s+(\d{1,3})\s+and\s+(\d{1,3})/i
Gender Detection
Female Patterns
// Estonian dative
/\b(naisele|emale|tüdrukule|tüttrele|sõbrannale|vanaemale|tädile|õele)\b/i
// English
/\b(woman|female|girl|mother|mom|grandmother|sister|aunt|wife|girlfriend)\b/i
// Contextual
/\b(she|her)\s+(loves?|enjoys?|likes?)\b/i
Male Patterns
// Estonian dative
/\b(mehele|isale|poisile|pojale|sõbrale|vanaisale|onule|vennale)\b/i
// English
/\b(man|male|boy|father|dad|grandfather|brother|uncle|husband|boyfriend)\b/i
// Contextual
/\b(he|him)\s+(loves?|enjoys?|likes?)\b/i
Recipient Mapping
Maps relationship terms to normalized recipient strings:
| Estonian | English | Normalized |
|---|---|---|
| emale, mamale | for my mother, for mom | ema |
| isale, papale | for my father, for dad | isa |
| vanaemale | for grandmother, for grandma | vanaema |
| naisele | for my wife, for a woman | naine |
| sõbrale | for a friend, for my buddy | sõber |
| kolleegile | for a colleague, for coworker | kolleeg |
| õpetajale | for a teacher, for professor | õpetaja |
Hobby Detection
Maps hobby keywords to normalized hobby strings:
| Category | Estonian Keywords | English Keywords | Normalized |
|---|---|---|---|
| Gardening | aiandus, aed, taim, lill | garden, plant, flower | aiandus |
| Cooking | kokandus, toit, küpset | cook, bake, culinary | kokandus |
| Reading | lugemine, raamat | read, book, literature | lugemine |
| Sports | sport, jooks, treening | sports, fitness, gym | sport |
| Music | muusika, laul, pill | music, sing, instrument | muusika |
| Crafts | käsitöö, kudum | craft, knit, sewing | käsitöö |
| Technology | tehnoloogia, arvuti | tech, computer, coding | tehnoloogia |
| Travel | reisi, matka | travel, hiking | reisimine |
Integration Points
Main Extractor (main-extractor.ts)
// After LLM extraction, apply deterministic demographics
const deterministicDemographics = extractDeterministicDemographics(userMessage, debug);
if (deterministicDemographics.extracted) {
mergeDemographicsIntoContext(normalized, deterministicDemographics, { debug });
// Fix hallucinated author_search intent
if (normalized.intent === 'author_search' && hasGiftKeyword) {
normalized.intent = 'general_gift';
normalized.authorName = undefined;
}
}
Fast Classifier (classifier-context.ts)
Same pattern applied after fast classifier results to ensure demographics are captured even on the fast path.
API Reference
extractDeterministicDemographics()
function extractDeterministicDemographics(
userMessage: string,
debug?: boolean
): DemographicsResult
interface DemographicsResult {
recipientAge?: number;
recipientAgeRange?: { min: number; max: number };
ageGroup?: 'child' | 'teen' | 'adult' | 'elderly';
ageBracket?: AgeBracket;
recipientGender?: 'male' | 'female' | 'unisex' | 'unknown';
recipient?: string;
hobbies?: string[];
extracted: boolean;
}
mergeDemographicsIntoContext()
function mergeDemographicsIntoContext(
context: GiftContext,
demographics: DemographicsResult,
options?: { overwrite?: boolean; debug?: boolean }
): void
Examples
Example 1: Child Gift (Estonian)
Input: "Kingitus 7-aastasele tüdrukule"
Output: {
recipientAge: 7,
ageGroup: 'child',
ageBracket: 'school_age',
recipientGender: 'female',
recipient: 'tütar'
}
Example 2: Adult Gift (English)
Input: "Gift for my 65-year-old father who loves fishing"
Output: {
recipientAge: 65,
ageGroup: 'elderly',
recipientGender: 'male',
recipient: 'isa',
hobbies: ['kalapüük']
}
Example 3: Age Range (Estonian)
Input: "Kingitus 7-9 aastasele lapsele"
Output: {
recipientAge: 8, // midpoint
recipientAgeRange: { min: 7, max: 9 },
ageGroup: 'child',
recipient: 'laps'
}
Related Documentation
- Context Extraction - LLM-based extraction
- Fast Classifier - Quick intent classification
- Routing Pipeline - Query routing with gift context guard