Estonian Language Challenges with LLMs
This section documents the unique challenges of working with Estonian language in LLM-powered systems and the solutions implemented in Kingisoovitaja.
Executive Summary
Working with Estonian language in LLM-powered systems presents unique challenges due to Estonian's agglutinative morphology, extensive case system, and compound word formation.
Key Finding: While modern LLMs (GPT-4, GPT-5.1) have reasonable Estonian understanding, the interaction between LLM outputs and downstream processing logic creates the most significant issues.
Estonian Language Characteristics
Estonian is a Finno-Ugric language with properties that challenge typical NLP systems:
1. Agglutinative Morphology
Words are built by combining morphemes:
luule (poetry) + raamat (book) + -ut (partitive) = luuleraamatut
2. Extensive Case System
14 grammatical cases, each with distinct suffixes:
| Case | Example | Usage |
|---|---|---|
| Nominative | raamat | Subject: "the book" |
| Genitive | raamatu | Possession: "book's" |
| Partitive | raamatut | Object: "a book (obj.)" |
| Illative | raamatusse | Into: "into the book" |
| Inessive | raamatus | In: "in the book" |
| Elative | raamatust | From: "from the book" |
...and 8 more cases
Impact: A single English keyword like "book" must match ~14 Estonian variants.
3. Compound Word Formation
Estonian freely creates compound words:
luuleraamat = poetry book
krimiraamat = mystery book
ulmeraamat = sci-fi book
fantaasiaraamat = fantasy book
kokaraamat = cookbook
Each compound can take all 14 case endings:
kokaraamat → kokaraamatu, kokaraamatut, kokaraamatusse...
Result: Exponential growth in word variants.
4. Vowel Harmony & Stem Changes
Words change internally based on grammatical rules:
laps (child) → lapse (gen.) → lapsele (all.) → laste (pl.gen.)
This is NOT just suffix addition — the word stem itself changes.
Problem Categories
Success Metrics
Before Comprehensive Fixes
| Metric | Before |
|---|---|
| Estonian compound word detection | 20% |
| Language switching errors | 15-20% |
| Text repetition incidents | ~5% |
| User satisfaction (Estonian) | 3.2/5 |
After Comprehensive Fixes
| Metric | After | Improvement |
|---|---|---|
| Estonian compound word detection | 80-95% | +75% |
| Language switching errors | <5% | -70% |
| Text repetition incidents | <0.1% | -98% |
| User satisfaction (Estonian) | 4.4/5 | +38% |
Core Insight
The biggest insight: Modern LLMs handle Estonian quite well semantically. Problems arose from:
- Downstream rules overriding LLM's correct understanding
- Keyword-based systems failing on morphological variations
- Data truncation starving the LLM of context
- Contradictory instructions causing degenerate behavior
Solution paradigm: Trust the LLM more, use rules less. Rules should validate and enhance LLM outputs, not override them.
Documentation Structure
Detailed documentation for each challenge:
- Compound Words - Detection and handling strategies
- Morphological Cases - Dative case mapping and normalization
- Mixed Language - Code-switching and detection
- Text Quality - Repetition prevention and fixes
- Cultural Context - Occasions, budgets, and localization
- Best Practices - Lessons learned and recommendations
Quick Reference
Estonian Language Support Matrix
| Feature | Status | Coverage |
|---|---|---|
| Compound word detection | ~40 common compounds | |
| Case ending support | 3 main cases (nom/gen/part) | |
| Dative case conversion | ~20 common recipients | |
| Mixed language queries | Pattern-based detection | |
| Cultural calendar | 9 Estonian occasions | |
| Encoding normalization | UTF-8 + error correction |
Related Documentation
- Validation Guardrails - Anti-hallucination checks
- System Prompts: Estonian - Estonian prompt rules
- Language Service - Language detection details