Estonian Text Quality & Repetition Prevention

The text repetition problem ("cover cover cover cover") revealed critical issues with data truncation and contradictory instructions that affected Estonian more severely.

The Problem

Scenario: LLM responses contained repetitive text:

AI: "This classic edition offers the iconic black cover cover cover cover"

Why critical:

Destroys user trust
Appears broken/buggy
Estonian MORE affected (less training data)

Root Causes

Root Cause 1: Severe Data Truncation

Location: response-orchestrator.ts:230-241

The Problem:

// BEFORE (Starving AI)
const titleShort = (p.title || '').slice(0, 60).trim();  //  60 chars
const descShort = descRaw 
  ? (descRaw.length > 80 ? `${descRaw.slice(0, 80)}…` : descRaw)  //  80 chars
  : '';

Why this caused repetition:

Product description: 
"This classic edition offers the iconic black cover with ring symbol and features..."

Truncated to: 
"This classic edition offers the iconic black cover with ring symbol and featur…"

LLM sees: Incomplete thought ending with "featur…"
LLM tries to: Complete the thought, repeats "cover"

Estonian Impact:

Estonian words are longer on average:

English: "book" (4 chars)
Estonian: "raamat" (7 chars)
English: "poetry" (6 chars)
Estonian: "luulekogu" (9 chars)

→ 80-character truncation cuts MORE semantic content in Estonian

Root Cause 2: Contradictory Instructions

// LLM receives ALL of these:
"Write detailed, conversion-focused descriptions" (encourages length)
"Up to 200 words total" (limits length)
"Be brief and precise" (contradicts #1)
"3-4 sentences per product" (moderate length)

Result: Confusion → Degenerate behavior → Repetition

Estonian Impact:

Estonian requires more words for same meaning:

English: "for teacher" (2 words)
Estonian: "õpetajale" (1 word, but 10 characters)
English sentence: 10-15 words average
Estonian sentence: 12-18 words average

→ Word count limits hit Estonian harder

Root Cause 3: Token Pressure

max_tokens: 2500  //  Too tight for Estonian

Requirements:

Intro: ~50 tokens
3 products × 3-4 sentences × ~25 tokens/sentence = 225-300 tokens
Metadata: ~50 tokens
Total needed: ~400 tokens
Comfortable: ~600 tokens

With Estonian:

Estonian tokens are longer
Same semantic content = more tokens
2500 limit gets tight

Solutions

Fix 1: Remove Severe Truncation

// AFTER (Providing meaningful context)
const titleShort = (p.title || '').slice(0, 150).trim();  //  150 chars (2.5× more)
const descShort = descRaw 
  ? (descRaw.length > 300 ? `${descRaw.slice(0, 300)}…` : descRaw)  //  300 chars (3.75× more)
  : '';

Impact:

Titles: 60 → 150 chars (2.5× more context)
Descriptions: 80 → 300 chars (3.75× more context)
AI sees complete product information

Fix 2: Eliminate Contradictions

// BEFORE - Config-driven (DANGEROUS)
import { chatConfig } from '@/app/chat/config';  //  REMOVED
const sentenceRange = chatConfig.sentencesPerProduct >= 4 ? '3-4' : '2-3';
const userSuffix = `(up to ${config.maxWords} words)`;

// AFTER - Hardcoded (SAFE)
const sentenceRange = '3–4';  //  Hardcoded everywhere
//  Removed ALL word count limits
//  Removed conflicting brevity instructions

Critical changes:

Removed chatConfig dependency ENTIRELY
Hardcoded '3–4' in all locations
Removed ALL maxWords references
ONE clear instruction

Result: ZERO contradictions, ZERO runtime variations

Fix 3: Increased Token Limit

// BEFORE
max_tokens: 2500  //  Too tight

// AFTER
max_tokens: 3000  //  20% buffer

Calculation:

Intro: ~150 tokens
3 products × 3-4 sentences × ~80 tokens = 720-960 tokens
Total needed: ~900-1100 tokens
New limit: 3000 (comfortable buffer for Estonian)

Fix 4: Repetition Detection (Safety Net)

// Detection thresholds
const CONSECUTIVE_THRESHOLD = 3;  // "cover cover cover"
const REPETITION_THRESHOLD = 3;   // Frequent repetition
const REPETITION_WINDOW = 20;     // Last 20 words

// In streaming loop
const lastWords = accumulatedWords.slice(-REPETITION_WINDOW);
const wordCounts = new Map<string, number>();

for (const word of lastWords) {
  wordCounts.set(word, (wordCounts.get(word) || 0) + 1);
  
  if (wordCounts.get(word)! >= REPETITION_THRESHOLD) {
    console.warn('[REPETITION] Detected:', word);
    break; // Stop streaming
  }
}

Test Results

Query: "ingliskeelne raamat"

Before:

AI: "This classic edition offers the iconic black cover cover cover cover"

Repetition
Cutoff
Poor UX

After:

AI: "Great, valisin sulle kaks suurepärast ingliskeelset raamatut:

1. **Czar's Madman**
   See on ingliskeelne teos, mis kuulub ilukirjanduse kategooriasse...
   
2. **Lord Of The Rings II: Two Towers**
   See ingliskeelne ilukirjandusteos on tõeline klassik..."

NO repetition
Complete sentences
Proper 3-4 sentence descriptions

Safety net status: Repetition detection NOT triggered - confirms root causes fixed

Why Estonian Was More Affected

Longer tokens: Estonian words = more tokens per semantic unit
- "poetry book" (English) = 2-3 tokens
- "luuleraamat" (Estonian) = 3-4 tokens
Truncation hit harder: 80-char limit cuts more meaning
Less training data: GPT models have less Estonian
- English: Robust under pressure
- Estonian: More fragile, degrades faster

After fix: Estonian responses as good as English now.

Estonian Overview - Overview of challenges
Best Practices - Implementation guidelines

The Problem​

Root Causes​

Root Cause 1: Severe Data Truncation​

Root Cause 2: Contradictory Instructions​

Root Cause 3: Token Pressure​

Solutions​

Fix 1: Remove Severe Truncation​

Fix 2: Eliminate Contradictions​

Fix 3: Increased Token Limit​

Fix 4: Repetition Detection (Safety Net)​

Test Results​

Query: "ingliskeelne raamat"​

Why Estonian Was More Affected​

Related Documentation​