Skip to main content

Estonian Language Best Practices

Lessons learned and recommendations from building Kingisoovitaja, applicable to any Estonian LLM project.

Key Lessons

Lesson 1: Agglutinative Languages Need Special Handling

What we learned:

  • Keyword lists explode in size for Estonian (14 cases × compounds)
  • Simple substring matching not enough
  • LLM understanding often better than keyword matching

Recommendation:

  • Rely MORE on LLM, less on keywords for Estonian/Finnish/Turkish
  • Use keywords as validation, not primary detection
  • Maintain smaller, higher-confidence keyword lists

Lesson 2: LLM Understanding ≠ System Behavior

What we learned:

  • LLM correctly understood "luuleraamatut" = poetry book
  • But downstream rules overrode this correct understanding
  • The problem was NOT the LLM, it was our rigid rules

Recommendation:

  • Trust LLM's semantic understanding
  • Use rules for validation, not override
  • If LLM and rules disagree, investigate WHY (often LLM is right)

Lesson 3: Data Truncation Kills LLM Quality

What we learned:

  • 80-character truncation caused text repetition
  • LLM tried to complete thoughts with insufficient context
  • Estonian hit harder (longer words = less content in 80 chars)

Recommendation:

  • Never starve the LLM of context
  • Err on side of too much data rather than too little
  • If token limits tight, reduce output length, not input context

Lesson 4: Contradictory Instructions = Degenerate Behavior

What we learned:

  • "Be detailed" + "Be brief" + "200 words max" = confusion
  • LLM tries to satisfy contradictory requirements → fails
  • Dynamic config made it worse

Recommendation:

  • Provide ONE clear, unambiguous instruction
  • Remove contradictory requirements
  • Hardcode critical AI behavior (don't make it configurable)

Lesson 5: Mixed Language is Common

What we learned:

  • Estonian users naturally code-switch: "Otsin sci-fi raamatut"
  • Technical terms often stay in English
  • Proper nouns add language ambiguity

Recommendation:

  • Plan for mixed-language queries from day one
  • Don't assume single-language per query
  • Default to primary market language for ambiguous cases

Lesson 6: Cultural Context Can't Be Hardcoded

What we learned:

  • Valentine's Day ≠ only romantic dates (also "Friend's Day")
  • Occasions have different dates (Mother's Day: May 9, not floating Sunday)
  • Gift budgets vary by culture

Recommendation:

  • Build culture-specific calendars
  • Make cultural rules configurable per market
  • Budget expectations should be data-driven

Lesson 7: Testing Must Include Actual Language

What we learned:

  • English tests passed, Estonian failed
  • Compound words revealed flaws English tests missed
  • Mixed-language queries broke assumptions

Recommendation:

  • Test in target language(s) from day one
  • Include compound words, case endings, mixed-language
  • Don't rely on English-only test suite

Architectural Patterns

Pattern 1: LLM + Rule Hybrid

Principle: LLM for understanding, rules for precision

User query

LLM extraction (understanding)

Rule-based validation (precision)

Safety override (catch errors)

Final decision

Pattern 2: Signal Preservation

Principle: Preserve initial LLM signals before modifications

// Capture initial state
const initialBookSignal = isBookSignal(context.productType);

// Allow modifications
context = sanitize(context);
context = applyRules(context);

// Safety check
if (initialBookSignal && context.productType !== 'Raamat') {
console.warn('Signal lost!');
// Restore or override
}

Pattern 3: Multilayer Fallbacks

Principle: Multiple independent checks

const wantsBooks = 
checkLLMSignals(context) || // Layer 1
checkKeywords(userMessage) || // Layer 2
checkConversationHistory(history); // Layer 3

// Only exclude if ALL layers say no
if (!wantsBooks) {
excludeBooks();
}

Pattern 4: Comprehensive Telemetry

Principle: Log every decision point

console.log('[TELEMETRY] wantsBooksFromContext:', {
decision: wantsBooks,
signals: {
productTypeIsBook,
messageHasBookKeyword,
negativePhrase
},
context: { productType, userMessage },
timestamp: new Date().toISOString()
});

When to Use LLM vs Rules

Use LLM for:

  • Understanding user intent
  • Handling natural language variations
  • Context-dependent decisions
  • Ambiguity resolution

Use Rules for:

  • Validation of LLM outputs
  • Hard constraints (age appropriateness)
  • Cultural/business logic
  • Edge case handling

Use Hybrid for:

  • Language detection (rules + LLM)
  • Product type detection (LLM + keywords)
  • Occasion inference (LLM + calendar)

Recommendations for Estonian NLP

Based on our experience:

1. Architecture

  • Use LLM for understanding, rules for validation
  • Preserve initial LLM signals
  • Implement multilayer fallbacks
  • Comprehensive telemetry

2. Keyword Handling

  • Keep lists small and high-confidence
  • Include 3 main cases (nom/gen/part)
  • Use for validation, not primary detection
  • Substring matching with caution

3. Language Detection

  • Multi-signal approach
  • Default to Estonian for ambiguous
  • Handle mixed-language gracefully
  • Use conversation history

4. LLM Prompting

  • Explicit Estonian grammar instructions
  • Examples of correct Estonian
  • Avoid contradictory instructions
  • Hardcode critical behavior

5. Testing

  • Test with actual Estonian text
  • Include compound words
  • Test case variations
  • Mix English technical terms
  • Edge cases: single-word, proper nouns

6. Cultural Adaptation

  • Estonian cultural calendar
  • Local gift-giving norms
  • Market-based budget expectations
  • Age/gender appropriateness rules

7. Data Quality

  • Never truncate aggressively
  • Provide full product information
  • UTF-8 encoding everywhere
  • Normalize Unicode early (NFC)

Broader Implications

These lessons apply to other morphologically rich languages:

  • Finnish (Finno-Ugric, 15 cases)
  • Turkish (Agglutinative, extensive suffixation)
  • Hungarian (Finno-Ugric, ~18 cases)
  • Arabic (Root-based morphology)
  • Japanese (Agglutinative particles)

Universal principle: Rely on LLM semantic understanding, augment with linguistic rules, but don't let rules override LLM insights.

Future Improvements

1. LLM-Based Classification

Current: Keyword lists Proposed: Add explicit classification field

interface GiftContext {
bookRequestType: 'explicit' | 'implicit' | 'none';
}

Benefits:

  • Automatically handles all compounds
  • No keyword maintenance
  • Context-aware

2. Morphological Analyzer

Current: Hardcoded dative mappings Proposed: Integrate Vabamorf

import { analyzeMorphology } from 'vabamorf';

static toEstonianDative(noun: string): string {
const analysis = analyzeMorphology(noun);
return analysis.inflect({ case: 'allative' });
}

Benefits:

  • Handles ANY Estonian noun
  • Correct for all edge cases

3. Conversation-Aware Language Switching

Current: Per-message detection Proposed: Track conversation language

interface ConversationContext {
primaryLanguage: 'et' | 'en';
languageConfidence: number;
}

// Switch only if strong signal
if (currentLanguage !== primaryLanguage && confidence > 0.9) {
primaryLanguage = currentLanguage;
}