Estonian Language Challenges with LLMs

This section documents the unique challenges of working with Estonian language in LLM-powered systems and the solutions implemented in Kingisoovitaja.

Executive Summary

Working with Estonian language in LLM-powered systems presents unique challenges due to Estonian's agglutinative morphology, extensive case system, and compound word formation.

Key Finding: While modern LLMs (GPT-4, GPT-5.1) have reasonable Estonian understanding, the interaction between LLM outputs and downstream processing logic creates the most significant issues.

Estonian Language Characteristics

Estonian is a Finno-Ugric language with properties that challenge typical NLP systems:

1. Agglutinative Morphology

Words are built by combining morphemes:

luule (poetry) + raamat (book) + -ut (partitive) = luuleraamatut

2. Extensive Case System

14 grammatical cases, each with distinct suffixes:

Case	Example	Usage
Nominative	raamat	Subject: "the book"
Genitive	raamatu	Possession: "book's"
Partitive	raamatut	Object: "a book (obj.)"
Illative	raamatusse	Into: "into the book"
Inessive	raamatus	In: "in the book"
Elative	raamatust	From: "from the book"

...and 8 more cases

Impact: A single English keyword like "book" must match ~14 Estonian variants.

3. Compound Word Formation

Estonian freely creates compound words:

luuleraamat = poetry book
krimiraamat = mystery book
ulmeraamat = sci-fi book
fantaasiaraamat = fantasy book
kokaraamat = cookbook

Each compound can take all 14 case endings:

kokaraamat → kokaraamatu, kokaraamatut, kokaraamatusse...

Result: Exponential growth in word variants.

4. Vowel Harmony & Stem Changes

Words change internally based on grammatical rules:

laps (child) → lapse (gen.) → lapsele (all.) → laste (pl.gen.)

This is NOT just suffix addition — the word stem itself changes.

Problem Categories

Success Metrics

Before Comprehensive Fixes

Metric	Before
Estonian compound word detection	20%
Language switching errors	15-20%
Text repetition incidents	~5%
User satisfaction (Estonian)	3.2/5

After Comprehensive Fixes

Metric	After	Improvement
Estonian compound word detection	80-95%	+75%
Language switching errors	<5%	-70%
Text repetition incidents	<0.1%	-98%
User satisfaction (Estonian)	4.4/5	+38%

Core Insight

The biggest insight: Modern LLMs handle Estonian quite well semantically. Problems arose from:

Downstream rules overriding LLM's correct understanding
Keyword-based systems failing on morphological variations
Data truncation starving the LLM of context
Contradictory instructions causing degenerate behavior

Solution paradigm: Trust the LLM more, use rules less. Rules should validate and enhance LLM outputs, not override them.

Documentation Structure

Detailed documentation for each challenge:

Compound Words - Detection and handling strategies
Morphological Cases - Dative case mapping and normalization
Mixed Language - Code-switching and detection
Text Quality - Repetition prevention and fixes
Cultural Context - Occasions, budgets, and localization
Best Practices - Lessons learned and recommendations

Quick Reference

Estonian Language Support Matrix

Feature	Status	Coverage
Compound word detection		~40 common compounds
Case ending support		3 main cases (nom/gen/part)
Dative case conversion		~20 common recipients
Mixed language queries		Pattern-based detection
Cultural calendar		9 Estonian occasions
Encoding normalization		UTF-8 + error correction

Validation Guardrails - Anti-hallucination checks
System Prompts: Estonian - Estonian prompt rules
Language Service - Language detection details

Executive Summary​

Estonian Language Characteristics​

1. Agglutinative Morphology​

2. Extensive Case System​

3. Compound Word Formation​

4. Vowel Harmony & Stem Changes​

Problem Categories​

Success Metrics​

Before Comprehensive Fixes​

After Comprehensive Fixes​

Core Insight​

Documentation Structure​

Quick Reference​

Estonian Language Support Matrix​

Related Documentation​