Skip to main content

Deterministic Demographics Extraction

A regex-based safety net that ensures demographic data (age, gender, recipient, hobbies) is always captured, even when the LLM fails to extract it properly.

Why This Approach Is Needed

The Problem

LLM-based context extraction is unreliable for structured demographic data in Estonian.

Example Query:

"Kingitus 50-aastasele naisele, kes armastab aiandust"
(Gift for a 50-year-old woman who loves gardening)

What the LLM returned:

{
"intent": "author_search",
"authorName": "Tolkien",
"productType": "Raamat"
}

What was expected:

{
"intent": "general_gift",
"recipientAge": 50,
"recipientGender": "female",
"recipient": "naine",
"hobbies": ["aiandus"]
}

The LLM:

  • ❌ Hallucinated "Tolkien" as an author (possibly from "armastab" → love → romance?)
  • ❌ Returned author_search intent instead of general_gift
  • ❌ Completely ignored explicit demographics: age=50, gender=female, hobby=gardening

Root Causes of LLM Failure

IssueDescription
Estonian MorphologyComplex case system - "naisele" (dative) vs "naine" (nominative) confuses models
Semantic OptimizationLLMs optimize for meaning, not structured extraction
Speed vs AccuracyFast models (Llama) prioritize speed over precision
Prompt LimitationsInstructions are suggestions, not guarantees

Why Regex Instead of Another LLM Call

FactorRegexLLM
Reliability100% deterministicProbabilistic
Speed~1-2ms200-2000ms
HallucinationImpossibleCommon
TestabilityEasy unit testsComplex
MaintenanceAdd patternsPrompt engineering

LLM vs Deterministic: Side-by-Side Comparison

Architecture

High-Level Flow

Pattern Matching Pipeline

Integration Sequence

Supported Patterns

Age Extraction (1-120 years)

Estonian Patterns

// Dative: "50-aastasele", "50 aastasele"
/(\d{1,3})\s*-?\s*aastase?l?e?/i

// Nominative: "50-aastane"
/(\d{1,3})\s*-?\s*aastane\b/i

// Descriptive: "50 aastat vana"
/(\d{1,3})\s*-?\s*aasta(?:t|ne)?\s*van/i

// Locative: "vanuses 50"
/vanuse?s?\s*(\d{1,3})/i

English Patterns

// Standard: "50 year old", "50-year-old"
/(\d{1,3})\s*-?\s*years?\s*-?\s*old/i

// Prefix: "age 50", "aged 50"
/age[d]?\s*(\d{1,3})/i

// Informal: "50 yo", "50y/o"
/(\d{1,3})\s*y(?:\/)?o\b/i

// Contextual: "turning 50", "who is 50"
/turn(?:s|ing)\s*(\d{1,3})/i
/who(?:'s| is)\s*(\d{1,3})/i

Age Ranges

// Estonian: "7-9 aastasele", "7 kuni 9 aastasele"
/(\d{1,3})\s*-\s*(\d{1,3})\s*-?\s*aastase/i
/(\d{1,3})\s+kuni\s+(\d{1,3})\s*-?\s*aastase/i

// English: "7-9 years old", "between 7 and 9"
/(\d{1,3})\s*-\s*(\d{1,3})\s*years?\s*old/i
/between\s+(\d{1,3})\s+and\s+(\d{1,3})/i

Gender Detection

Female Patterns

// Estonian dative
/\b(naisele|emale|tüdrukule|tüttrele|sõbrannale|vanaemale|tädile|õele)\b/i

// English
/\b(woman|female|girl|mother|mom|grandmother|sister|aunt|wife|girlfriend)\b/i

// Contextual
/\b(she|her)\s+(loves?|enjoys?|likes?)\b/i

Male Patterns

// Estonian dative
/\b(mehele|isale|poisile|pojale|sõbrale|vanaisale|onule|vennale)\b/i

// English
/\b(man|male|boy|father|dad|grandfather|brother|uncle|husband|boyfriend)\b/i

// Contextual
/\b(he|him)\s+(loves?|enjoys?|likes?)\b/i

Recipient Mapping

Maps relationship terms to normalized recipient strings:

EstonianEnglishNormalized
emale, mamalefor my mother, for momema
isale, papalefor my father, for dadisa
vanaemalefor grandmother, for grandmavanaema
naiselefor my wife, for a womannaine
sõbralefor a friend, for my buddysõber
kolleegilefor a colleague, for coworkerkolleeg
õpetajalefor a teacher, for professorõpetaja

Hobby Detection

Maps hobby keywords to normalized hobby strings:

CategoryEstonian KeywordsEnglish KeywordsNormalized
Gardeningaiandus, aed, taim, lillgarden, plant, floweraiandus
Cookingkokandus, toit, küpsetcook, bake, culinarykokandus
Readinglugemine, raamatread, book, literaturelugemine
Sportssport, jooks, treeningsports, fitness, gymsport
Musicmuusika, laul, pillmusic, sing, instrumentmuusika
Craftskäsitöö, kudumcraft, knit, sewingkäsitöö
Technologytehnoloogia, arvutitech, computer, codingtehnoloogia
Travelreisi, matkatravel, hikingreisimine

Integration Points

Main Extractor (main-extractor.ts)

// After LLM extraction, apply deterministic demographics
const deterministicDemographics = extractDeterministicDemographics(userMessage, debug);
if (deterministicDemographics.extracted) {
mergeDemographicsIntoContext(normalized, deterministicDemographics, { debug });

// Fix hallucinated author_search intent
if (normalized.intent === 'author_search' && hasGiftKeyword) {
normalized.intent = 'general_gift';
normalized.authorName = undefined;
}
}

Fast Classifier (classifier-context.ts)

Same pattern applied after fast classifier results to ensure demographics are captured even on the fast path.

API Reference

extractDeterministicDemographics()

function extractDeterministicDemographics(
userMessage: string,
debug?: boolean
): DemographicsResult

interface DemographicsResult {
recipientAge?: number;
recipientAgeRange?: { min: number; max: number };
ageGroup?: 'child' | 'teen' | 'adult' | 'elderly';
ageBracket?: AgeBracket;
recipientGender?: 'male' | 'female' | 'unisex' | 'unknown';
recipient?: string;
hobbies?: string[];
extracted: boolean;
}

mergeDemographicsIntoContext()

function mergeDemographicsIntoContext(
context: GiftContext,
demographics: DemographicsResult,
options?: { overwrite?: boolean; debug?: boolean }
): void

Examples

Example 1: Child Gift (Estonian)

Input: "Kingitus 7-aastasele tüdrukule"
Output: {
recipientAge: 7,
ageGroup: 'child',
ageBracket: 'school_age',
recipientGender: 'female',
recipient: 'tütar'
}

Example 2: Adult Gift (English)

Input: "Gift for my 65-year-old father who loves fishing"
Output: {
recipientAge: 65,
ageGroup: 'elderly',
recipientGender: 'male',
recipient: 'isa',
hobbies: ['kalapüük']
}

Example 3: Age Range (Estonian)

Input: "Kingitus 7-9 aastasele lapsele"
Output: {
recipientAge: 8, // midpoint
recipientAgeRange: { min: 7, max: 9 },
ageGroup: 'child',
recipient: 'laps'
}