Estonian Language Best Practices
Lessons learned and recommendations from building Kingisoovitaja, applicable to any Estonian LLM project.
Key Lessons
Lesson 1: Agglutinative Languages Need Special Handling
What we learned:
- Keyword lists explode in size for Estonian (14 cases × compounds)
- Simple substring matching not enough
- LLM understanding often better than keyword matching
Recommendation:
- Rely MORE on LLM, less on keywords for Estonian/Finnish/Turkish
- Use keywords as validation, not primary detection
- Maintain smaller, higher-confidence keyword lists
Lesson 2: LLM Understanding ≠ System Behavior
What we learned:
- LLM correctly understood "luuleraamatut" = poetry book
- But downstream rules overrode this correct understanding
- The problem was NOT the LLM, it was our rigid rules
Recommendation:
- Trust LLM's semantic understanding
- Use rules for validation, not override
- If LLM and rules disagree, investigate WHY (often LLM is right)
Lesson 3: Data Truncation Kills LLM Quality
What we learned:
- 80-character truncation caused text repetition
- LLM tried to complete thoughts with insufficient context
- Estonian hit harder (longer words = less content in 80 chars)
Recommendation:
- Never starve the LLM of context
- Err on side of too much data rather than too little
- If token limits tight, reduce output length, not input context
Lesson 4: Contradictory Instructions = Degenerate Behavior
What we learned:
- "Be detailed" + "Be brief" + "200 words max" = confusion
- LLM tries to satisfy contradictory requirements → fails
- Dynamic config made it worse
Recommendation:
- Provide ONE clear, unambiguous instruction
- Remove contradictory requirements
- Hardcode critical AI behavior (don't make it configurable)
Lesson 5: Mixed Language is Common
What we learned:
- Estonian users naturally code-switch: "Otsin sci-fi raamatut"
- Technical terms often stay in English
- Proper nouns add language ambiguity
Recommendation:
- Plan for mixed-language queries from day one
- Don't assume single-language per query
- Default to primary market language for ambiguous cases
Lesson 6: Cultural Context Can't Be Hardcoded
What we learned:
- Valentine's Day ≠ only romantic dates (also "Friend's Day")
- Occasions have different dates (Mother's Day: May 9, not floating Sunday)
- Gift budgets vary by culture
Recommendation:
- Build culture-specific calendars
- Make cultural rules configurable per market
- Budget expectations should be data-driven
Lesson 7: Testing Must Include Actual Language
What we learned:
- English tests passed, Estonian failed
- Compound words revealed flaws English tests missed
- Mixed-language queries broke assumptions
Recommendation:
- Test in target language(s) from day one
- Include compound words, case endings, mixed-language
- Don't rely on English-only test suite
Architectural Patterns
Pattern 1: LLM + Rule Hybrid
Principle: LLM for understanding, rules for precision
User query
↓
LLM extraction (understanding)
↓
Rule-based validation (precision)
↓
Safety override (catch errors)
↓
Final decision
Pattern 2: Signal Preservation
Principle: Preserve initial LLM signals before modifications
// Capture initial state
const initialBookSignal = isBookSignal(context.productType);
// Allow modifications
context = sanitize(context);
context = applyRules(context);
// Safety check
if (initialBookSignal && context.productType !== 'Raamat') {
console.warn('Signal lost!');
// Restore or override
}
Pattern 3: Multilayer Fallbacks
Principle: Multiple independent checks
const wantsBooks =
checkLLMSignals(context) || // Layer 1
checkKeywords(userMessage) || // Layer 2
checkConversationHistory(history); // Layer 3
// Only exclude if ALL layers say no
if (!wantsBooks) {
excludeBooks();
}
Pattern 4: Comprehensive Telemetry
Principle: Log every decision point
console.log('[TELEMETRY] wantsBooksFromContext:', {
decision: wantsBooks,
signals: {
productTypeIsBook,
messageHasBookKeyword,
negativePhrase
},
context: { productType, userMessage },
timestamp: new Date().toISOString()
});
When to Use LLM vs Rules
Use LLM for:
- Understanding user intent
- Handling natural language variations
- Context-dependent decisions
- Ambiguity resolution
Use Rules for:
- Validation of LLM outputs
- Hard constraints (age appropriateness)
- Cultural/business logic
- Edge case handling
Use Hybrid for:
- Language detection (rules + LLM)
- Product type detection (LLM + keywords)
- Occasion inference (LLM + calendar)
Recommendations for Estonian NLP
Based on our experience:
1. Architecture
- Use LLM for understanding, rules for validation
- Preserve initial LLM signals
- Implement multilayer fallbacks
- Comprehensive telemetry
2. Keyword Handling
- Keep lists small and high-confidence
- Include 3 main cases (nom/gen/part)
- Use for validation, not primary detection
- Substring matching with caution
3. Language Detection
- Multi-signal approach
- Default to Estonian for ambiguous
- Handle mixed-language gracefully
- Use conversation history
4. LLM Prompting
- Explicit Estonian grammar instructions
- Examples of correct Estonian
- Avoid contradictory instructions
- Hardcode critical behavior
5. Testing
- Test with actual Estonian text
- Include compound words
- Test case variations
- Mix English technical terms
- Edge cases: single-word, proper nouns
6. Cultural Adaptation
- Estonian cultural calendar
- Local gift-giving norms
- Market-based budget expectations
- Age/gender appropriateness rules
7. Data Quality
- Never truncate aggressively
- Provide full product information
- UTF-8 encoding everywhere
- Normalize Unicode early (NFC)
Broader Implications
These lessons apply to other morphologically rich languages:
- Finnish (Finno-Ugric, 15 cases)
- Turkish (Agglutinative, extensive suffixation)
- Hungarian (Finno-Ugric, ~18 cases)
- Arabic (Root-based morphology)
- Japanese (Agglutinative particles)
Universal principle: Rely on LLM semantic understanding, augment with linguistic rules, but don't let rules override LLM insights.
Future Improvements
1. LLM-Based Classification
Current: Keyword lists Proposed: Add explicit classification field
interface GiftContext {
bookRequestType: 'explicit' | 'implicit' | 'none';
}
Benefits:
- Automatically handles all compounds
- No keyword maintenance
- Context-aware
2. Morphological Analyzer
Current: Hardcoded dative mappings Proposed: Integrate Vabamorf
import { analyzeMorphology } from 'vabamorf';
static toEstonianDative(noun: string): string {
const analysis = analyzeMorphology(noun);
return analysis.inflect({ case: 'allative' });
}
Benefits:
- Handles ANY Estonian noun
- Correct for all edge cases
3. Conversation-Aware Language Switching
Current: Per-message detection Proposed: Track conversation language
interface ConversationContext {
primaryLanguage: 'et' | 'en';
languageConfidence: number;
}
// Switch only if strong signal
if (currentLanguage !== primaryLanguage && confidence > 0.9) {
primaryLanguage = currentLanguage;
}
Related Documentation
- Estonian Overview - Challenge overview
- Compound Words - Compound handling
- Morphology - Case system
- Mixed Language - Code-switching
- Text Quality - Repetition prevention
- Cultural Context - Localization