Estonian Language Best Practices

Lessons learned and recommendations from building Kingisoovitaja, applicable to any Estonian LLM project.

Key Lessons

Lesson 1: Agglutinative Languages Need Special Handling

What we learned:

Keyword lists explode in size for Estonian (14 cases × compounds)
Simple substring matching not enough
LLM understanding often better than keyword matching

Recommendation:

Rely MORE on LLM, less on keywords for Estonian/Finnish/Turkish
Use keywords as validation, not primary detection
Maintain smaller, higher-confidence keyword lists

Lesson 2: LLM Understanding ≠ System Behavior

What we learned:

LLM correctly understood "luuleraamatut" = poetry book
But downstream rules overrode this correct understanding
The problem was NOT the LLM, it was our rigid rules

Recommendation:

Trust LLM's semantic understanding
Use rules for validation, not override
If LLM and rules disagree, investigate WHY (often LLM is right)

Lesson 3: Data Truncation Kills LLM Quality

What we learned:

80-character truncation caused text repetition
LLM tried to complete thoughts with insufficient context
Estonian hit harder (longer words = less content in 80 chars)

Recommendation:

Never starve the LLM of context
Err on side of too much data rather than too little
If token limits tight, reduce output length, not input context

Lesson 4: Contradictory Instructions = Degenerate Behavior

What we learned:

"Be detailed" + "Be brief" + "200 words max" = confusion
LLM tries to satisfy contradictory requirements → fails
Dynamic config made it worse

Recommendation:

Provide ONE clear, unambiguous instruction
Remove contradictory requirements
Hardcode critical AI behavior (don't make it configurable)

Lesson 5: Mixed Language is Common

What we learned:

Estonian users naturally code-switch: "Otsin sci-fi raamatut"
Technical terms often stay in English
Proper nouns add language ambiguity

Recommendation:

Plan for mixed-language queries from day one
Don't assume single-language per query
Default to primary market language for ambiguous cases

Lesson 6: Cultural Context Can't Be Hardcoded

What we learned:

Valentine's Day ≠ only romantic dates (also "Friend's Day")
Occasions have different dates (Mother's Day: May 9, not floating Sunday)
Gift budgets vary by culture

Recommendation:

Build culture-specific calendars
Make cultural rules configurable per market
Budget expectations should be data-driven

Lesson 7: Testing Must Include Actual Language

What we learned:

English tests passed, Estonian failed
Compound words revealed flaws English tests missed
Mixed-language queries broke assumptions

Recommendation:

Test in target language(s) from day one
Include compound words, case endings, mixed-language
Don't rely on English-only test suite

Architectural Patterns

Pattern 1: LLM + Rule Hybrid

Principle: LLM for understanding, rules for precision

User query
↓
LLM extraction (understanding) 
↓
Rule-based validation (precision) 
↓
Safety override (catch errors) 
↓
Final decision

Pattern 2: Signal Preservation

Principle: Preserve initial LLM signals before modifications

// Capture initial state
const initialBookSignal = isBookSignal(context.productType);

// Allow modifications
context = sanitize(context);
context = applyRules(context);

// Safety check
if (initialBookSignal && context.productType !== 'Raamat') {
  console.warn('Signal lost!');
  // Restore or override
}

Pattern 3: Multilayer Fallbacks

Principle: Multiple independent checks

const wantsBooks = 
  checkLLMSignals(context) ||      // Layer 1
  checkKeywords(userMessage) ||    // Layer 2
  checkConversationHistory(history); // Layer 3

// Only exclude if ALL layers say no
if (!wantsBooks) {
  excludeBooks();
}

Pattern 4: Comprehensive Telemetry

Principle: Log every decision point

console.log('[TELEMETRY] wantsBooksFromContext:', {
  decision: wantsBooks,
  signals: {
    productTypeIsBook,
    messageHasBookKeyword,
    negativePhrase
  },
  context: { productType, userMessage },
  timestamp: new Date().toISOString()
});

When to Use LLM vs Rules

Use LLM for:

Understanding user intent
Handling natural language variations
Context-dependent decisions
Ambiguity resolution

Use Rules for:

Validation of LLM outputs
Hard constraints (age appropriateness)
Cultural/business logic
Edge case handling

Use Hybrid for:

Language detection (rules + LLM)
Product type detection (LLM + keywords)
Occasion inference (LLM + calendar)

Recommendations for Estonian NLP

Based on our experience:

1. Architecture

Use LLM for understanding, rules for validation
Preserve initial LLM signals
Implement multilayer fallbacks
Comprehensive telemetry

2. Keyword Handling

Keep lists small and high-confidence
Include 3 main cases (nom/gen/part)
Use for validation, not primary detection
Substring matching with caution

3. Language Detection

Multi-signal approach
Default to Estonian for ambiguous
Handle mixed-language gracefully
Use conversation history

4. LLM Prompting

Explicit Estonian grammar instructions
Examples of correct Estonian
Avoid contradictory instructions
Hardcode critical behavior

5. Testing

Test with actual Estonian text
Include compound words
Test case variations
Mix English technical terms
Edge cases: single-word, proper nouns

6. Cultural Adaptation

Estonian cultural calendar
Local gift-giving norms
Market-based budget expectations
Age/gender appropriateness rules

7. Data Quality

Never truncate aggressively
Provide full product information
UTF-8 encoding everywhere
Normalize Unicode early (NFC)

Broader Implications

These lessons apply to other morphologically rich languages:

Finnish (Finno-Ugric, 15 cases)
Turkish (Agglutinative, extensive suffixation)
Hungarian (Finno-Ugric, ~18 cases)
Arabic (Root-based morphology)
Japanese (Agglutinative particles)

Universal principle: Rely on LLM semantic understanding, augment with linguistic rules, but don't let rules override LLM insights.

Future Improvements

1. LLM-Based Classification

Current: Keyword lists Proposed: Add explicit classification field

interface GiftContext {
  bookRequestType: 'explicit' | 'implicit' | 'none';
}

Benefits:

Automatically handles all compounds
No keyword maintenance
Context-aware

2. Morphological Analyzer

Current: Hardcoded dative mappings Proposed: Integrate Vabamorf

import { analyzeMorphology } from 'vabamorf';

static toEstonianDative(noun: string): string {
  const analysis = analyzeMorphology(noun);
  return analysis.inflect({ case: 'allative' });
}

Benefits:

Handles ANY Estonian noun
Correct for all edge cases

3. Conversation-Aware Language Switching

Current: Per-message detection Proposed: Track conversation language

interface ConversationContext {
  primaryLanguage: 'et' | 'en';
  languageConfidence: number;
}

// Switch only if strong signal
if (currentLanguage !== primaryLanguage && confidence > 0.9) {
  primaryLanguage = currentLanguage;
}

Estonian Overview - Challenge overview
Compound Words - Compound handling
Morphology - Case system
Mixed Language - Code-switching
Text Quality - Repetition prevention
Cultural Context - Localization

Key Lessons​

Lesson 1: Agglutinative Languages Need Special Handling​

Lesson 2: LLM Understanding ≠ System Behavior​

Lesson 3: Data Truncation Kills LLM Quality​

Lesson 4: Contradictory Instructions = Degenerate Behavior​

Lesson 5: Mixed Language is Common​

Lesson 6: Cultural Context Can't Be Hardcoded​

Lesson 7: Testing Must Include Actual Language​

Architectural Patterns​

Pattern 1: LLM + Rule Hybrid​

Pattern 2: Signal Preservation​

Pattern 3: Multilayer Fallbacks​

Pattern 4: Comprehensive Telemetry​

When to Use LLM vs Rules​

Use LLM for:​

Use Rules for:​

Use Hybrid for:​

Recommendations for Estonian NLP​

1. Architecture​

2. Keyword Handling​

3. Language Detection​

4. LLM Prompting​

5. Testing​

6. Cultural Adaptation​

7. Data Quality​

Broader Implications​

Future Improvements​

1. LLM-Based Classification​

2. Morphological Analyzer​

3. Conversation-Aware Language Switching​

Related Documentation​

Key Lessons

Lesson 1: Agglutinative Languages Need Special Handling

Lesson 2: LLM Understanding ≠ System Behavior

Lesson 3: Data Truncation Kills LLM Quality

Lesson 4: Contradictory Instructions = Degenerate Behavior

Lesson 5: Mixed Language is Common

Lesson 6: Cultural Context Can't Be Hardcoded

Lesson 7: Testing Must Include Actual Language

Architectural Patterns

Pattern 1: LLM + Rule Hybrid

Pattern 2: Signal Preservation

Pattern 3: Multilayer Fallbacks

Pattern 4: Comprehensive Telemetry

When to Use LLM vs Rules

Use LLM for:

Use Rules for:

Use Hybrid for:

Recommendations for Estonian NLP

1. Architecture

2. Keyword Handling

3. Language Detection

4. LLM Prompting

5. Testing

6. Cultural Adaptation

7. Data Quality

Broader Implications

Future Improvements

1. LLM-Based Classification

2. Morphological Analyzer

3. Conversation-Aware Language Switching

Related Documentation