Skip to main content

Estonian Text Quality & Repetition Prevention

The text repetition problem ("cover cover cover cover") revealed critical issues with data truncation and contradictory instructions that affected Estonian more severely.

The Problem

Scenario: LLM responses contained repetitive text:

AI: "This classic edition offers the iconic black cover cover cover cover"

Why critical:

  • Destroys user trust
  • Appears broken/buggy
  • Estonian MORE affected (less training data)

Root Causes

Root Cause 1: Severe Data Truncation

Location: response-orchestrator.ts:230-241

The Problem:

// BEFORE (Starving AI)
const titleShort = (p.title || '').slice(0, 60).trim(); // 60 chars
const descShort = descRaw
? (descRaw.length > 80 ? `${descRaw.slice(0, 80)}` : descRaw) // 80 chars
: '';

Why this caused repetition:

Product description: 
"This classic edition offers the iconic black cover with ring symbol and features..."

Truncated to:
"This classic edition offers the iconic black cover with ring symbol and featur…"

LLM sees: Incomplete thought ending with "featur…"
LLM tries to: Complete the thought, repeats "cover"

Estonian Impact:

Estonian words are longer on average:

  • English: "book" (4 chars)
  • Estonian: "raamat" (7 chars)
  • English: "poetry" (6 chars)
  • Estonian: "luulekogu" (9 chars)

→ 80-character truncation cuts MORE semantic content in Estonian

Root Cause 2: Contradictory Instructions

// LLM receives ALL of these:
1. "Write detailed, conversion-focused descriptions" (encourages length)
2. "Up to 200 words total" (limits length)
3. "Be brief and precise" (contradicts #1)
4. "3-4 sentences per product" (moderate length)

Result: Confusion → Degenerate behavior → Repetition

Estonian Impact:

Estonian requires more words for same meaning:

  • English: "for teacher" (2 words)
  • Estonian: "õpetajale" (1 word, but 10 characters)
  • English sentence: 10-15 words average
  • Estonian sentence: 12-18 words average

→ Word count limits hit Estonian harder

Root Cause 3: Token Pressure

max_tokens: 2500  //  Too tight for Estonian

Requirements:

  • Intro: ~50 tokens
  • 3 products × 3-4 sentences × ~25 tokens/sentence = 225-300 tokens
  • Metadata: ~50 tokens
  • Total needed: ~400 tokens
  • Comfortable: ~600 tokens

With Estonian:

  • Estonian tokens are longer
  • Same semantic content = more tokens
  • 2500 limit gets tight

Solutions

Fix 1: Remove Severe Truncation

// AFTER (Providing meaningful context)
const titleShort = (p.title || '').slice(0, 150).trim(); // 150 chars (2.5× more)
const descShort = descRaw
? (descRaw.length > 300 ? `${descRaw.slice(0, 300)}` : descRaw) // 300 chars (3.75× more)
: '';

Impact:

  • Titles: 60 → 150 chars (2.5× more context)
  • Descriptions: 80 → 300 chars (3.75× more context)
  • AI sees complete product information

Fix 2: Eliminate Contradictions

// BEFORE - Config-driven (DANGEROUS)
import { chatConfig } from '@/app/chat/config'; // REMOVED
const sentenceRange = chatConfig.sentencesPerProduct >= 4 ? '3-4' : '2-3';
const userSuffix = `(up to ${config.maxWords} words)`;

// AFTER - Hardcoded (SAFE)
const sentenceRange = '3–4'; // Hardcoded everywhere
// Removed ALL word count limits
// Removed conflicting brevity instructions

Critical changes:

  1. Removed chatConfig dependency ENTIRELY
  2. Hardcoded '3–4' in all locations
  3. Removed ALL maxWords references
  4. ONE clear instruction

Result: ZERO contradictions, ZERO runtime variations

Fix 3: Increased Token Limit

// BEFORE
max_tokens: 2500 // Too tight

// AFTER
max_tokens: 3000 // 20% buffer

Calculation:

  • Intro: ~150 tokens
  • 3 products × 3-4 sentences × ~80 tokens = 720-960 tokens
  • Total needed: ~900-1100 tokens
  • New limit: 3000 (comfortable buffer for Estonian)

Fix 4: Repetition Detection (Safety Net)

// Detection thresholds
const CONSECUTIVE_THRESHOLD = 3; // "cover cover cover"
const REPETITION_THRESHOLD = 3; // Frequent repetition
const REPETITION_WINDOW = 20; // Last 20 words

// In streaming loop
const lastWords = accumulatedWords.slice(-REPETITION_WINDOW);
const wordCounts = new Map<string, number>();

for (const word of lastWords) {
wordCounts.set(word, (wordCounts.get(word) || 0) + 1);

if (wordCounts.get(word)! >= REPETITION_THRESHOLD) {
console.warn('[REPETITION] Detected:', word);
break; // Stop streaming
}
}

Test Results

Query: "ingliskeelne raamat"

Before:

AI: "This classic edition offers the iconic black cover cover cover cover"
  • Repetition
  • Cutoff
  • Poor UX

After:

AI: "Great, valisin sulle kaks suurepärast ingliskeelset raamatut:

1. **Czar's Madman**
See on ingliskeelne teos, mis kuulub ilukirjanduse kategooriasse...

2. **Lord Of The Rings II: Two Towers**
See ingliskeelne ilukirjandusteos on tõeline klassik..."
  • NO repetition
  • Complete sentences
  • Proper 3-4 sentence descriptions

Safety net status: Repetition detection NOT triggered - confirms root causes fixed

Why Estonian Was More Affected

  1. Longer tokens: Estonian words = more tokens per semantic unit

    • "poetry book" (English) = 2-3 tokens
    • "luuleraamat" (Estonian) = 3-4 tokens
  2. Truncation hit harder: 80-char limit cuts more meaning

  3. Less training data: GPT models have less Estonian

    • English: Robust under pressure
    • Estonian: More fragile, degrades faster

After fix: Estonian responses as good as English now.