Skip to main content

Estonian Compound Words Problem

The compound word detection failure was the most critical Estonian-specific bug encountered, causing 80% failure rate in certain gift scenarios.

The Problem

Scenario

User Query:

"Otsin luuleraamatut sõbrapäevakingiks"
Translation: "I'm looking for a poetry book for Valentine's gift"

Expected:

  • Detect "luuleraamatut" as explicit book request
  • Show poetry books
  • Type: Raamat

Actual (Before Fix):

  • LLM correctly detected: productType: "Raamat"
  • System added constraint: "Ei raamatuid" (No books)
  • Showed: Scented candles instead of books
  • Success rate: 20% (1/5 Valentine's scenarios)

Root Cause

The Paradox

The LLM correctly understood the Estonian compound word and set productType: "Raamat", but the system still excluded books.

Why?

Three-part failure:

  1. Keyword-based filtering used hardcoded list
  2. Gift occasion override ran AFTER LLM extraction
  3. Signal loss in pipeline

Original Keyword List

const BOOK_KEYWORDS = [
'raamat',
'raamatu',
'raamatut',
'raamatuid',
'raamatud',
'book',
'books',
'novel',
'novels'
];
// Only 9 keywords!

Problem: Missing ALL compound words!

Solution: Expanded Keywords

Location: extract-context.ts:30-73

const BOOK_KEYWORDS = [
// Original keywords
'raamat', 'raamatu', 'raamatut', 'raamatuid', 'raamatud',
'book', 'books', 'novel', 'novels', 'romaan', 'romaani',

// Poetry books (3 cases each)
'luuleraamat', 'luuleraamatu', 'luuleraamatut',

// Mystery books
'krimiraamat', 'krimiraamatu', 'krimiraamatut',

// Sci-fi books
'ulmeraamat', 'ulmeraamatu', 'ulmeraamatut',

// Fantasy books
'fantaasiaraamat', 'fantaasiaraamatu', 'fantaasiaraamatut',

// Cookbooks
'kokaraamat', 'kokaraamatu', 'kokaraamatut',

// Children's books
'lasteraamat', 'lasteraamatu', 'lasteraamatut',

// Travel books
'reisiraamat', 'reisiraamatu', 'reisiraamatut',

// Textbooks
'õpperaamat', 'õpperaamatu', 'õpperaamatut'
] as const;

Result:

  • From 9 keywords to 39 keywords
  • Covers most common compound book types
  • Includes 3 main cases per compound

Enhanced Detection Function

Location: extract-context.ts:375-414

const wantsBooksFromContext = (
context: GiftContext,
userMessage?: string
): boolean => {
// Check 1: LLM signals
const productTypeIsBook = isBookSignal(context.productType);
const categoryIsBook = isBookSignal(context.category);
const productTypeHintsHaveBooks = context.productTypeHints?.some(isBookSignal);
const categoryHintsHaveBooks = context.categoryHints?.some(isBookSignal);

const wantsBooksFromSignals =
productTypeIsBook ||
categoryIsBook ||
productTypeHintsHaveBooks ||
categoryHintsHaveBooks;

// Check 2: CRITICAL - Also check user message
const normalizedMessage = normalizeMessage(userMessage);
const messageHasBookKeyword = messageContainsKeyword(
normalizedMessage,
BOOK_KEYWORDS
);

// Check 3: Negative phrase detection
const negativeBookPhrase = findNegativeBookPhrase(normalizedMessage);

// Final decision: Want books if no negative phrase AND (signals OR keywords)
const wantsBooks = negativeBookPhrase
? false
: (wantsBooksFromSignals || messageHasBookKeyword);

return wantsBooks;
};

Key Improvement: Now checks BOTH:

  1. LLM-extracted signals (productType, category, hints)
  2. User message keywords (catches compounds LLM might miss)

Safety Override

Location: extract-context.ts:492-497, 685-697

// Preserve initial book signals
const initialBookSignalsPresent =
isBookSignal(giftContext.productType) ||
isBookSignal(giftContext.category) ||
giftContext.productTypeHints?.some(isBookSignal) ||
giftContext.categoryHints?.some(isBookSignal);

// ... later in processing ...

// SAFETY OVERRIDE: Prevent false negatives
if (!wantsBooks && (initialBookSignalsPresent || messageHasBookKeyword)) {
console.warn('[SAFETY] Overriding wantsBooks=false', {
initialBookSignalsPresent,
messageHasBookKeyword
});
wantsBooks = true; // Force to true
}

Why: Even if intermediate steps corrupt context, safety override catches the explicit signal.

Removed Problematic Override

What was removed:

//  REMOVED: This overrode LLM decisions
if (ROMANTIC_OCCASIONS.includes(occasion)) {
if (!explicitlyRequestedBooks) {
queryRewrite.excludeProductTypes = ['Raamat'];
queryRewrite.forceProductTypes = ['Kingitused'];
}
}

Why it was wrong:

  • Assumed Valentine's = never books
  • But users DO want poetry books for Valentine's!
  • Overrode LLM's correct understanding

Test Results

Query: "Otsin luuleraamatut sõbrapäevakingiks"

Before Fix:

  • Constraints: "Ei raamatuid"
  • Showed: Scented candles
  • Success: 20%

After Fix:

  • Constraints: "affordable, romantic"
  • Showed: Romantic poetry books
  • Success: 80%

Improvement: 20% → 80% success rate

Remaining Limitations

Rare Compounds Not in List

// Not yet in list:
"biografiaraamat" (biography book)
"ajalooraamat" (history book)
"filosoofiaraamat" (philosophy book)

Mitigation: LLM signal checking still works as backup

Typos in Compounds

"luleraamatut" (missing 'u')
"luuleraamaut" (wrong vowel)

Mitigation: LLM often corrects these internally

Case Endings Beyond Main Three

// We only added nominative/genitive/partitive
// Missing rarer cases like:
"luuleraamatusse" (illative)
"luuleraamatus" (inessive)

Mitigation: These cases are rare in gift queries