Skip to main content

High-Level Architecture

Overview

This document provides a comprehensive overview of how the Agentic RAG Gift Recommendation System processes user messages from HTTP request to personalized product recommendations. The system combines LLM-powered context understanding, intelligent routing, multi-strategy search, and streaming response generation to deliver a ChatGPT-like gift shopping experience.

Core Goal: Transform conversational queries like "I need a gift for my mother's birthday under 50 euro" into personalized, relevant product recommendations in under 1 second.


System Architecture Overview


📋 Request Lifecycle (Happy Path)

Step-by-Step Flow


🔧 Key Components Deep Dive

1. HTTP Entry Point (route.ts)

Purpose: HTTP handler for /api/chat endpoint

Responsibilities:

  • Parse incoming requests
  • Validate API keys and environment
  • Route to appropriate orchestrator (Parallel vs Sequential)
  • Handle CORS and error responses
  • Apply response headers

Flow:

Location: app/api/chat/route.ts


2. Parallel Orchestrator (ParallelOrchestrator)

Purpose: Optimized request flow with skeleton responses for sub-800ms Time to First Chunk (TTFC)

Key Features:

  • Skeleton Response: Emits initial response in <100ms while processing continues
  • Parallel Execution: Context extraction and response generation run concurrently
  • Smart Routing: Intent-based handler selection
  • Validation Gates: Nonsense detection, query validation

Flow:

Performance Benefits:

  • TTFC: <100ms (skeleton) vs 800-1200ms (sequential)
  • Total Time: ~1200-1500ms (same as sequential, but feels faster)
  • UX: Immediate feedback, progressive loading

Location: app/api/chat/orchestrators/parallel-orchestrator.ts


3. Context Understanding System

Purpose: Extract structured GiftContext from natural language queries

Three-Stage Pipeline:

Stage 1: Deterministic Bypass

When: Obvious product keywords detected (e.g., "gift card", "kinkekaart")

Benefits:

  • Zero LLM cost
  • ~0ms latency
  • 100% accuracy for known patterns

Example:

  • Query: "show me gift cards"
  • Bypass: Match kinkekaart pattern
  • Result: productType: "Kinkekaart", confidence: 1.0

Stage 2: Fast Classifier

When: No bypass match, no author/pronoun patterns

Model: meta-llama/llama-4-scout-17b-16e-instruct (Groq)

Timeout: 4 seconds

Fast Path Intents:

  • show_more_products
  • greeting
  • question
  • cheaper_alternatives
  • budget_alternatives
  • purchase_confirmation
  • Occasion-specific intents (Valentine's, Mother's Day, etc.)

Confidence Threshold: ≥ 0.1

Flow:


Stage 3: Main Extractor

When: Fast path fails, or author/pronoun detected

Model: meta-llama/llama-4-scout-17b-16e-instruct (Groq)

Timeout: 25 seconds

Features:

  • Enhanced Semantic Prompt: Richer examples, pronoun resolution
  • Conversation State Injection: Resolves "tema" → actual author name
  • Full Context: Complete intent taxonomy, constraints, age detection
  • Hierarchical Refinement: Optional category refinement when low confidence

Output: Complete GiftContext with all fields populated


4. GiftContext Structure

Purpose: Structured representation of user intent

interface GiftContext {
// Core Intent
intent: string; // "product_search", "author_search", etc.
confidence: number; // 0.0 - 1.0

// Gift Context
occasion?: string; // "sünnipäev", "valentinipäev", etc.
recipient?: string; // "ema", "sõber", "kolleeg", etc.
recipientGender?: 'male' | 'female' | 'unisex' | 'unknown';
ageGroup?: 'child' | 'teen' | 'adult' | 'elderly' | 'unknown';
ageBracket?: AgeBracket; // Fine-grained age ranges
recipientAge?: number;

// Product Signals
productType?: string; // "Raamat", "Mängud", "Kinkekaart", etc.
category?: string; // Specific category
productTypeHints?: string[]; // Multiple type suggestions
categoryHints?: string[]; // Multiple category suggestions

// Budget
budget?: {
min?: number;
max?: number;
hint?: string; // "affordable", "luxury", etc.
};

// Constraints
constraints?: string[]; // ["MITTE raamat", "eco-friendly", etc.]

// Author/Book Context
authorName?: string;
bookLanguage?: 'et' | 'en';

// Metadata
language: 'et' | 'en' | 'mixed';
isPopularQuery?: boolean;
timestamp?: number;
meta?: GiftContextMeta; // Telemetry
}

5. Handler Router

Purpose: Route requests to appropriate handler based on intent and context

Routing Logic:

Handlers:

HandlerIntent TypesPurpose
ProductSearchHandlerproduct_search, author_search, occasion intentsExecute product search
ConversationalHandlergreeting, question, thank_youNon-product responses
ClarifyingQuestionHandlerLow confidence queriesAsk clarifying questions
ProductInquiryHandlerproduct_inquiryAnswer questions about specific products
ShowMoreHandlershow_more_productsPagination, more results

Location: app/api/chat/handlers/handler-router.ts


6. Search Orchestrator

Purpose: Execute multi-strategy product search with quality filters

Search Pipeline:

Query Rewriting

Purpose: Generate multiple search queries to maximize recall

Strategies:

  • Specific queries: Focus on detected product type/category
  • Exploratory queries: Broaden to related categories
  • Occasion-specific: Add occasion context (e.g., "birthday gift")
  • Recipient-specific: Add recipient context (e.g., "for mother")
  • Fallback queries: Generic gift searches as backup

Example:

Input: "Gift for mother's birthday under 50 euro"

Variations:

  1. "sünnipäevakingitus emale" (specific, Estonian)
  2. "kingitus emale" (broader)
  3. "birthday gift for mother" (English)
  4. "emadepäeva kingitus" (occasion alternative)
  5. "naistele kingitus" (gender-based fallback)

Purpose: Execute multiple search strategies in parallel

Search Types:

Vector Search (Convex):

  • Uses embeddings for semantic similarity
  • Good for: "romantic gift", "practical gift", concepts
  • Model: OpenAI text-embedding-3-small

Text Search (Convex):

  • Keyword matching on title, category, product type
  • Good for: Specific products, brands, exact matches

Hybrid Search:

  • Combines vector + text with weighted scoring
  • Vector weight: 0.7, Text weight: 0.3
  • Best of both worlds

Filtering & Diversity

Filters Applied:

  1. Exclusion Filter: Remove previously shown products
  2. Budget Filter: price >= budget.min && price <= budget.max
  3. Language Filter: For books, filter by bookLanguage
  4. Constraint Filter: Apply negative constraints (MITTE raamat)
  5. Author Filter: For author searches, match author name

Diversity Enhancements:

  1. Category Diversity: Mix of product types (not all books)
  2. Price Diversity: Spread across budget range (low/mid/high)
  3. Gender Boost: Prioritize gender-appropriate products
  4. Freshness: Include some newer products

Example:

Query: "Valentine gifts under 100 euro for girlfriend"

Without Diversity:

  • 5× Romantic novels (all books, 15-20€)

With Diversity:

  • 1× Romantic novel (18€)
  • 1× Scented candle set (32€)
  • 1× Jewelry (45€)
  • 1× Spa gift set (28€)
  • 1× Chocolate box (15€)

Semantic Reranking

Purpose: Final quality ranking based on semantic relevance

Provider: Cohere Rerank API (fallback: OpenAI embeddings)

How It Works:

  1. Take top 20-30 candidates from search
  2. Send query + product titles to reranking API
  3. Get relevance scores (0-1)
  4. Sort by score, return top 5-7

Benefits:

  • Better relevance than pure keyword matching
  • Understands nuanced queries ("romantic but practical")
  • Cross-lingual matching (Estonian query → English products)

Location: app/api/chat/services/search/


7. Response Orchestrator

Purpose: Generate streaming chat responses with product card injection

Response Pipeline:

System Prompt Generation

Purpose: Dynamic prompt based on context and products

Components:

  1. Base Persona: Gift recommendation expert, friendly, Estonian/English
  2. Product Context: Injected product details (title, price, category)
  3. User Context: Occasion, recipient, budget from GiftContext
  4. Conversation History: Recent messages for continuity
  5. Constraints: Apply user preferences (e.g., "no books")

Example:

// Generated for "Gift for mother's birthday under 50 euro"
{
system: `You are a friendly Estonian gift recommendation expert.

User Context:
- Occasion: Birthday (sünnipäev)
- Recipient: Mother (ema)
- Budget: Under 50 euro

Available Products:
1. "Kaunid lillevaas" - 32.99€ (Home & Garden)
2. "Lõhnaküünal lavendel" - 18.50€ (Candles)
3. "Kinkeraamat 'Südamega kokk'" - 24.99€ (Books)
...

Provide personalized recommendations explaining why each gift suits the recipient.`,

temperature: 0.8,
model: "gpt-4o"
}

Product Card Injection

Purpose: Insert structured product cards into streaming response

Format:

{
"type": "product",
"id": "product_123",
"title": "Kaunid lillevaas",
"price": 32.99,
"category": "Kodu ja aed",
"image": "https://...",
"url": "https://...",
"reasoning": "Perfect for mother who loves flowers..."
}

Injection Point: After initial explanation text, before closing remarks

Benefits:

  • Structured data for frontend rendering
  • Seamless integration with streaming text
  • Progressive loading (products appear as stream progresses)

Smart Suggestions

Purpose: Follow-up question suggestions to continue conversation

Types:

  1. Show More: "Näita veel tooteid" (show more products)
  2. Refine Budget: "Näita odavamaid variante" (cheaper alternatives)
  3. Category Explore: "Näita raamatuid" (explore specific category)
  4. Related Occasions: "Mis sobib emadepäevaks?" (related occasions)

Generation: Dynamic based on context and search results

Example:

{
"suggestions": [
{ "text": "Näita veel sünnipäevakingitusi", "intent": "show_more" },
{ "text": "Mis sobib alla 30 euro?", "intent": "budget_alternatives" },
{ "text": "Näita kinkeraamatuid", "intent": "category_explore" }
]
}

Location: app/api/chat/services/response/


8. State Persistence (Convex)

Purpose: Store conversation state for follow-ups and pronoun resolution

Stored Data:

interface ConversationState {
conversationId: string;
userId?: string;

// Author Context (for pronoun resolution)
primaryAuthor?: string;
authors?: string[];

// Taxonomy Persistence
lastProductType?: string;
lastCategory?: string;

// Exclusions (for "show more")
excludedProductIds?: string[];

// Budget Context
lastBudget?: { min?: number; max?: number };

// Metadata
lastUpdated: number;
messageCount: number;
}

Use Cases:

  1. Pronoun Resolution: "tema teosed" → resolves to last mentioned author
  2. Show More: Excludes previously shown products
  3. Budget Persistence: Remembers budget across turns
  4. Taxonomy Continuity: "näita raamatuid" → uses last recipient/occasion

Flow:

Location: convex/actions/setConversationContext.ts, convex/schema.ts


🚦 Routing & Safeguards

1. Deterministic Bypass

Purpose: Skip LLM calls for obvious category keywords

Patterns:

  • kinkekaart, gift cardproductType: "Kinkekaart"
  • raamat, bookproductType: "Raamat"
  • mäng, gameproductType: "Mängud"

Benefits:

  • Zero latency
  • Zero cost
  • 100% accuracy

2. Fast-Path Allowlist

Purpose: Only safe intents can use fast classifier short-circuit

Allowed Intents:

  • show_more_products - Simple pagination
  • greeting - No search needed
  • question - General questions
  • cheaper_alternatives - Budget refinement
  • budget_alternatives - Budget refinement
  • purchase_confirmation - Confirmation
  • Occasion intents - Clear intent, no ambiguity

Blocked Intents:

  • product_search - Needs full context
  • author_search - Needs pronoun resolution
  • product_inquiry - Needs product details

3. Author/Pronoun Guard

Purpose: Prevent fast classifier from hijacking pronoun queries

Detection:

// Pattern 1: Explicit author names
const hasAuthorPattern = /\b[A-ZÕÄÖÜõäöü][a-zõäöü]+...(?:lt|i\s+teosed)/i;

// Pattern 2: Author pronouns
const hasAuthorPronoun = /\b(tema|teda|temalt|selle\s+autori|that\s+author)/i;

Action: Skip fast classifier, force enhanced LLM extraction

Example:

  • Query: "näita veel tema teoseid"
  • Without guard: Fast classifier → intent: show_more_products
  • With guard: Enhanced LLM → intent: author_search, resolve pronoun → authorName: "Tolkien"

4. Confidence-Aware Routing

Thresholds:

  • High (≥ 0.7): Execute search directly
  • Medium (0.5-0.7): Search if has product signals, otherwise clarify
  • Low (< 0.5): Ask clarifying question

Signal Detection: See Context Signals Documentation


Configuration & Feature Flags

Environment Variables

# === Context Extraction ===
# Enable parallel race between classifier and main extractor
PARALLEL_CONTEXT_EXTRACTION_ENABLED=true

# Head start for classifier (ms) before main extractor starts
PARALLEL_CLASSIFIER_HEADSTART_MS=200

# Force-skip fast classifier globally
CONTEXT_CLASSIFIER_DISABLED=false

# Enable enhanced semantic prompt with conversation state
ENHANCED_SEMANTIC_PROMPT=true

# Enable hierarchical category refinement
HIERARCHICAL_CATEGORY_ENABLED=true

# === Orchestration ===
# Enable parallel orchestrator for optimized TTFC
PARALLEL_EXECUTION_ENABLE=true

# Enable clarifying questions for vague queries
CLARIFYING_QUESTIONS_ENABLED=true

# === Models ===
# Context extraction model (Groq)
CONTEXT_EXTRACTION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct

# Fast classifier model (Groq)
FAST_CLASSIFIER_MODEL=meta-llama/llama-4-scout-17b-16e-instruct

# Response generation model (OpenAI)
OPENAI_MODEL=gpt-4o

# === Search ===
# Enable search result randomization for diversity
ENABLE_SEARCH_RANDOMIZE=true

# Enable semantic reranking
ENABLE_SEMANTIC_RERANK=true

# === Debugging ===
# Enable verbose debug logging
CHAT_DEBUG_LOGS=true

# Enable search debug logs
SEARCH_DEBUG_LOGS=true

Data Flow Summary

Input Data

// HTTP Request
POST /api/chat
Content-Type: application/json

{
"messages": [
{ "role": "user", "content": "Kingitus emale sünnipäevaks alla 50 euro" }
],
"conversationId": "conv_123"
}

Intermediate Data

// GiftContext (after extraction)
{
intent: "birthday_gift",
occasion: "sünnipäev",
recipient: "ema",
productType: "Kingitused",
productTypeHints: ["Raamat", "Kodu ja aed", "Ilu ja stiil"],
budget: { max: 50 },
language: "et",
confidence: 0.8,
meta: {
classifierUsed: true,
extractionDurationMs: 245
}
}

Output Data

// Streaming Response
{
"type": "text",
"content": "Siin on mõned sobivad sünnipäevakingitused teie emale:\n\n"
}

{
"type": "product",
"id": "prod_456",
"title": "Kaunid lillevaas",
"price": 32.99,
"category": "Kodu ja aed",
"image": "...",
"reasoning": "Ilus ja praktiline, sobib kodu kaunistamiseks..."
}

{
"type": "text",
"content": "\n\nKõik need kingitused mahuvad teie eelarve..."
}

{
"type": "suggestions",
"items": [
"Näita veel sünnipäevakingitusi",
"Mis sobib alla 30 euro?"
]
}

Observability & Debugging

Debug Logging

Enable:

export CHAT_DEBUG_LOGS=true
export SEARCH_DEBUG_LOGS=true

Output Locations:

  1. Context Extraction:
 FAST CLASSIFIER CALLED: { query: "kingitus emale...", timestamp: "..." }
FAST CLASSIFIER RESULT: { intent: "birthday_gift", confidence: 0.8, durationMs: 245 }
  1. Routing Decision:
 ROUTING DECISION: {
intent: "product_search",
confidence: 0.8,
hasProductSignals: true,
handler: "ProductSearchHandler"
}
  1. Search Execution:
 MULTI-QUERY SEARCH: 3 variations generated
SEARCH RESULTS: { vector: 12, text: 8, hybrid: 15, merged: 20 }
DIVERSITY APPLIED: { before: 20, after: 7, categoryMix: true }
  1. Response Generation:
 RESPONSE STARTED: streaming enabled
PRODUCT INJECTION: 5 products injected at position 245
RESPONSE COMPLETE: { totalTokens: 456, durationMs: 1234 }

Telemetry Fields

GiftContext Meta:

{
classifierUsed: boolean,
classifierConfidence: number,
classifierDurationMs: number,
fallbackTriggered: boolean,
extractionDurationMs: number,
parallelMode: boolean,
hierarchicalUsed: boolean,
// ... more fields
}

Use Cases:

  • Performance monitoring
  • A/B testing (classifier vs main extractor)
  • Confidence calibration
  • Error tracking

Performance Characteristics

Latency Breakdown (Typical)

StageSequentialParallelImprovement
TTFC (Time to First Chunk)800-1200ms<100ms8-12x faster
Context Extraction200-400ms200-400msSame
Product Search300-500ms300-500msSame
Response Generation200-400ms200-400msSame
Total~1200-1500ms~1200-1500msSame

Key Insight: Parallel mode doesn't reduce total time, but dramatically improves perceived performance by showing immediate feedback.


Cost Optimization

LLM Call Hierarchy (cheapest → most expensive):

  1. Deterministic Bypass: $0 (no LLM)
  2. Fast Classifier: ~$0.0001 per query (Groq Llama, 120 tokens)
  3. Main Extractor: ~$0.0005 per query (Groq Llama, 500 tokens)
  4. Response Generation: ~$0.02 per query (OpenAI GPT-4o, 1000 tokens)

Cost Savings:

  • Deterministic bypass: ~10% of queries (100% cost saving)
  • Fast classifier fast-path: ~30% of queries (40% cost saving on extraction)
  • Total extraction savings: ~15-20% vs always using main extractor

Core Systems

Handlers & Orchestration

Search & Ranking

Response Generation


System Evolution

Current Version

  • Parallel orchestration for sub-100ms TTFC
  • Three-stage context extraction (bypass → classifier → main)
  • Multi-strategy search with semantic reranking
  • Streaming responses with mid-stream product injection

Recent Improvements

  • Added fast classifier for low-latency intent detection
  • Implemented parallel extraction race mode
  • Added author/pronoun skip guards
  • Enhanced diversity layer in search
  • Improved confidence scoring with signal detection

Future Roadmap

  • 🔜 Personalization based on user history
  • 🔜 Multi-turn conversation memory
  • 🔜 Image-based product recommendations
  • 🔜 Voice input support
  • 🔜 Real-time inventory integration

Quick Reference

Key Files

ComponentFile PathLines
HTTP Entryapp/api/chat/route.ts1-522
Parallel Orchestratororchestrators/parallel-orchestrator.ts56-899
Context Understandingservices/context-understanding/index.ts75-995
Fast Classifierservices/context-understanding/fast-classifier.ts27-178
Handler Routerhandlers/handler-router.ts-
Search Orchestratorservices/search/-
Response Orchestratorservices/response/-

Key Concepts

  • TTFC: Time to First Chunk (target: <100ms)
  • GiftContext: Structured intent representation
  • Fast Path: Fast classifier short-circuit
  • Signal Detection: Meaningful vs fallback signals
  • Multi-Query Search: Parallel search strategies
  • Semantic Reranking: Final relevance scoring

Last Updated: 2025-01-17
Version: 2.0
Status: Production Ready