Estonian Text Quality & Repetition Prevention
The text repetition problem ("cover cover cover cover") revealed critical issues with data truncation and contradictory instructions that affected Estonian more severely.
The Problem
Scenario: LLM responses contained repetitive text:
AI: "This classic edition offers the iconic black cover cover cover cover"
Why critical:
- Destroys user trust
- Appears broken/buggy
- Estonian MORE affected (less training data)
Root Causes
Root Cause 1: Severe Data Truncation
Location: response-orchestrator.ts:230-241
The Problem:
// BEFORE (Starving AI)
const titleShort = (p.title || '').slice(0, 60).trim(); // 60 chars
const descShort = descRaw
? (descRaw.length > 80 ? `${descRaw.slice(0, 80)}…` : descRaw) // 80 chars
: '';
Why this caused repetition:
Product description:
"This classic edition offers the iconic black cover with ring symbol and features..."
Truncated to:
"This classic edition offers the iconic black cover with ring symbol and featur…"
LLM sees: Incomplete thought ending with "featur…"
LLM tries to: Complete the thought, repeats "cover"
Estonian Impact:
Estonian words are longer on average:
- English: "book" (4 chars)
- Estonian: "raamat" (7 chars)
- English: "poetry" (6 chars)
- Estonian: "luulekogu" (9 chars)
→ 80-character truncation cuts MORE semantic content in Estonian
Root Cause 2: Contradictory Instructions
// LLM receives ALL of these:
1. "Write detailed, conversion-focused descriptions" (encourages length)
2. "Up to 200 words total" (limits length)
3. "Be brief and precise" (contradicts #1)
4. "3-4 sentences per product" (moderate length)
Result: Confusion → Degenerate behavior → Repetition
Estonian Impact:
Estonian requires more words for same meaning:
- English: "for teacher" (2 words)
- Estonian: "õpetajale" (1 word, but 10 characters)
- English sentence: 10-15 words average
- Estonian sentence: 12-18 words average
→ Word count limits hit Estonian harder
Root Cause 3: Token Pressure
max_tokens: 2500 // Too tight for Estonian
Requirements:
- Intro: ~50 tokens
- 3 products × 3-4 sentences × ~25 tokens/sentence = 225-300 tokens
- Metadata: ~50 tokens
- Total needed: ~400 tokens
- Comfortable: ~600 tokens
With Estonian:
- Estonian tokens are longer
- Same semantic content = more tokens
- 2500 limit gets tight
Solutions
Fix 1: Remove Severe Truncation
// AFTER (Providing meaningful context)
const titleShort = (p.title || '').slice(0, 150).trim(); // 150 chars (2.5× more)
const descShort = descRaw
? (descRaw.length > 300 ? `${descRaw.slice(0, 300)}…` : descRaw) // 300 chars (3.75× more)
: '';
Impact:
- Titles: 60 → 150 chars (2.5× more context)
- Descriptions: 80 → 300 chars (3.75× more context)
- AI sees complete product information
Fix 2: Eliminate Contradictions
// BEFORE - Config-driven (DANGEROUS)
import { chatConfig } from '@/app/chat/config'; // REMOVED
const sentenceRange = chatConfig.sentencesPerProduct >= 4 ? '3-4' : '2-3';
const userSuffix = `(up to ${config.maxWords} words)`;
// AFTER - Hardcoded (SAFE)
const sentenceRange = '3–4'; // Hardcoded everywhere
// Removed ALL word count limits
// Removed conflicting brevity instructions
Critical changes:
- Removed
chatConfigdependency ENTIRELY - Hardcoded
'3–4'in all locations - Removed ALL
maxWordsreferences - ONE clear instruction
Result: ZERO contradictions, ZERO runtime variations
Fix 3: Increased Token Limit
// BEFORE
max_tokens: 2500 // Too tight
// AFTER
max_tokens: 3000 // 20% buffer
Calculation:
- Intro: ~150 tokens
- 3 products × 3-4 sentences × ~80 tokens = 720-960 tokens
- Total needed: ~900-1100 tokens
- New limit: 3000 (comfortable buffer for Estonian)
Fix 4: Repetition Detection (Safety Net)
// Detection thresholds
const CONSECUTIVE_THRESHOLD = 3; // "cover cover cover"
const REPETITION_THRESHOLD = 3; // Frequent repetition
const REPETITION_WINDOW = 20; // Last 20 words
// In streaming loop
const lastWords = accumulatedWords.slice(-REPETITION_WINDOW);
const wordCounts = new Map<string, number>();
for (const word of lastWords) {
wordCounts.set(word, (wordCounts.get(word) || 0) + 1);
if (wordCounts.get(word)! >= REPETITION_THRESHOLD) {
console.warn('[REPETITION] Detected:', word);
break; // Stop streaming
}
}
Test Results
Query: "ingliskeelne raamat"
Before:
AI: "This classic edition offers the iconic black cover cover cover cover"
- Repetition
- Cutoff
- Poor UX
After:
AI: "Great, valisin sulle kaks suurepärast ingliskeelset raamatut:
1. **Czar's Madman**
See on ingliskeelne teos, mis kuulub ilukirjanduse kategooriasse...
2. **Lord Of The Rings II: Two Towers**
See ingliskeelne ilukirjandusteos on tõeline klassik..."
- NO repetition
- Complete sentences
- Proper 3-4 sentence descriptions
Safety net status: Repetition detection NOT triggered - confirms root causes fixed
Why Estonian Was More Affected
-
Longer tokens: Estonian words = more tokens per semantic unit
- "poetry book" (English) = 2-3 tokens
- "luuleraamat" (Estonian) = 3-4 tokens
-
Truncation hit harder: 80-char limit cuts more meaning
-
Less training data: GPT models have less Estonian
- English: Robust under pressure
- Estonian: More fragile, degrades faster
After fix: Estonian responses as good as English now.
Related Documentation
- Estonian Overview - Overview of challenges
- Best Practices - Implementation guidelines