Skip to main content

Estonian Language Challenges with LLMs

This section documents the unique challenges of working with Estonian language in LLM-powered systems and the solutions implemented in Kingisoovitaja.

Executive Summary

Working with Estonian language in LLM-powered systems presents unique challenges due to Estonian's agglutinative morphology, extensive case system, and compound word formation.

Key Finding: While modern LLMs (GPT-4, GPT-5.1) have reasonable Estonian understanding, the interaction between LLM outputs and downstream processing logic creates the most significant issues.

Estonian Language Characteristics

Estonian is a Finno-Ugric language with properties that challenge typical NLP systems:

1. Agglutinative Morphology

Words are built by combining morphemes:

luule (poetry) + raamat (book) + -ut (partitive) = luuleraamatut

2. Extensive Case System

14 grammatical cases, each with distinct suffixes:

CaseExampleUsage
NominativeraamatSubject: "the book"
GenitiveraamatuPossession: "book's"
PartitiveraamatutObject: "a book (obj.)"
IllativeraamatusseInto: "into the book"
InessiveraamatusIn: "in the book"
ElativeraamatustFrom: "from the book"

...and 8 more cases

Impact: A single English keyword like "book" must match ~14 Estonian variants.

3. Compound Word Formation

Estonian freely creates compound words:

luuleraamat = poetry book
krimiraamat = mystery book
ulmeraamat = sci-fi book
fantaasiaraamat = fantasy book
kokaraamat = cookbook

Each compound can take all 14 case endings:

kokaraamat → kokaraamatu, kokaraamatut, kokaraamatusse...

Result: Exponential growth in word variants.

4. Vowel Harmony & Stem Changes

Words change internally based on grammatical rules:

laps (child) → lapse (gen.) → lapsele (all.) → laste (pl.gen.)

This is NOT just suffix addition — the word stem itself changes.

Problem Categories

Success Metrics

Before Comprehensive Fixes

MetricBefore
Estonian compound word detection20%
Language switching errors15-20%
Text repetition incidents~5%
User satisfaction (Estonian)3.2/5

After Comprehensive Fixes

MetricAfterImprovement
Estonian compound word detection80-95%+75%
Language switching errors<5%-70%
Text repetition incidents<0.1%-98%
User satisfaction (Estonian)4.4/5+38%

Core Insight

The biggest insight: Modern LLMs handle Estonian quite well semantically. Problems arose from:

  1. Downstream rules overriding LLM's correct understanding
  2. Keyword-based systems failing on morphological variations
  3. Data truncation starving the LLM of context
  4. Contradictory instructions causing degenerate behavior

Solution paradigm: Trust the LLM more, use rules less. Rules should validate and enhance LLM outputs, not override them.

Documentation Structure

Detailed documentation for each challenge:

Quick Reference

Estonian Language Support Matrix

FeatureStatusCoverage
Compound word detection~40 common compounds
Case ending support3 main cases (nom/gen/part)
Dative case conversion~20 common recipients
Mixed language queriesPattern-based detection
Cultural calendar9 Estonian occasions
Encoding normalizationUTF-8 + error correction