High-Level Architecture

Overview

This document provides a comprehensive overview of how the Agentic RAG Gift Recommendation System processes user messages from HTTP request to personalized product recommendations. The system combines LLM-powered context understanding, intelligent routing, multi-strategy search, and streaming response generation to deliver a ChatGPT-like gift shopping experience.

Core Goal: Transform conversational queries like "I need a gift for my mother's birthday under 50 euro" into personalized, relevant product recommendations in under 1 second.

System Architecture Overview

📋 Request Lifecycle (Happy Path)

Step-by-Step Flow

🔧 Key Components Deep Dive

1. HTTP Entry Point (`route.ts`)

Purpose: HTTP handler for /api/chat endpoint

Responsibilities:

Parse incoming requests
Validate API keys and environment
Route to appropriate orchestrator (Parallel vs Sequential)
Handle CORS and error responses
Apply response headers

Flow:

Location: app/api/chat/route.ts

2. Parallel Orchestrator (`ParallelOrchestrator`)

Purpose: Optimized request flow with skeleton responses for sub-800ms Time to First Chunk (TTFC)

Key Features:

Skeleton Response: Emits initial response in <100ms while processing continues
Parallel Execution: Context extraction and response generation run concurrently
Smart Routing: Intent-based handler selection
Validation Gates: Nonsense detection, query validation

Flow:

Performance Benefits:

TTFC: <100ms (skeleton) vs 800-1200ms (sequential)
Total Time: ~1200-1500ms (same as sequential, but feels faster)
UX: Immediate feedback, progressive loading

Location: app/api/chat/orchestrators/parallel-orchestrator.ts

3. Context Understanding System

Purpose: Extract structured GiftContext from natural language queries

Three-Stage Pipeline:

Stage 1: Deterministic Bypass

When: Obvious product keywords detected (e.g., "gift card", "kinkekaart")

Benefits:

Zero LLM cost
~0ms latency
100% accuracy for known patterns

Example:

Query: "show me gift cards"
Bypass: Match kinkekaart pattern
Result: productType: "Kinkekaart", confidence: 1.0

Stage 2: Fast Classifier

When: No bypass match, no author/pronoun patterns

Model: meta-llama/llama-4-scout-17b-16e-instruct (Groq)

Timeout: 4 seconds

Fast Path Intents:

show_more_products
greeting
question
cheaper_alternatives
budget_alternatives
purchase_confirmation
Occasion-specific intents (Valentine's, Mother's Day, etc.)

Confidence Threshold: ≥ 0.1

Flow:

Stage 3: Main Extractor

When: Fast path fails, or author/pronoun detected

Model: meta-llama/llama-4-scout-17b-16e-instruct (Groq)

Timeout: 25 seconds

Features:

Enhanced Semantic Prompt: Richer examples, pronoun resolution
Conversation State Injection: Resolves "tema" → actual author name
Full Context: Complete intent taxonomy, constraints, age detection
Hierarchical Refinement: Optional category refinement when low confidence

Output: Complete GiftContext with all fields populated

4. GiftContext Structure

Purpose: Structured representation of user intent

interface GiftContext {
  // Core Intent
  intent: string;  // "product_search", "author_search", etc.
  confidence: number;  // 0.0 - 1.0
  
  // Gift Context
  occasion?: string;  // "sünnipäev", "valentinipäev", etc.
  recipient?: string;  // "ema", "sõber", "kolleeg", etc.
  recipientGender?: 'male' | 'female' | 'unisex' | 'unknown';
  ageGroup?: 'child' | 'teen' | 'adult' | 'elderly' | 'unknown';
  ageBracket?: AgeBracket;  // Fine-grained age ranges
  recipientAge?: number;
  
  // Product Signals
  productType?: string;  // "Raamat", "Mängud", "Kinkekaart", etc.
  category?: string;  // Specific category
  productTypeHints?: string[];  // Multiple type suggestions
  categoryHints?: string[];  // Multiple category suggestions
  
  // Budget
  budget?: {
    min?: number;
    max?: number;
    hint?: string;  // "affordable", "luxury", etc.
  };
  
  // Constraints
  constraints?: string[];  // ["MITTE raamat", "eco-friendly", etc.]
  
  // Author/Book Context
  authorName?: string;
  bookLanguage?: 'et' | 'en';
  
  // Metadata
  language: 'et' | 'en' | 'mixed';
  isPopularQuery?: boolean;
  timestamp?: number;
  meta?: GiftContextMeta;  // Telemetry
}

5. Handler Router

Purpose: Route requests to appropriate handler based on intent and context

Routing Logic:

Handlers:

Handler	Intent Types	Purpose
ProductSearchHandler	`product_search`, `author_search`, occasion intents	Execute product search
ConversationalHandler	`greeting`, `question`, `thank_you`	Non-product responses
ClarifyingQuestionHandler	Low confidence queries	Ask clarifying questions
ProductInquiryHandler	`product_inquiry`	Answer questions about specific products
ShowMoreHandler	`show_more_products`	Pagination, more results

Location: app/api/chat/handlers/handler-router.ts

6. Search Orchestrator

Purpose: Execute multi-strategy product search with quality filters

Search Pipeline:

Query Rewriting

Purpose: Generate multiple search queries to maximize recall

Strategies:

Specific queries: Focus on detected product type/category
Exploratory queries: Broaden to related categories
Occasion-specific: Add occasion context (e.g., "birthday gift")
Recipient-specific: Add recipient context (e.g., "for mother")
Fallback queries: Generic gift searches as backup

Example:

Input: "Gift for mother's birthday under 50 euro"

Variations:

"sünnipäevakingitus emale" (specific, Estonian)
"kingitus emale" (broader)
"birthday gift for mother" (English)
"emadepäeva kingitus" (occasion alternative)
"naistele kingitus" (gender-based fallback)

Multi-Query Search

Purpose: Execute multiple search strategies in parallel

Search Types:

Vector Search (Convex):

Uses embeddings for semantic similarity
Good for: "romantic gift", "practical gift", concepts
Model: OpenAI text-embedding-3-small

Text Search (Convex):

Keyword matching on title, category, product type
Good for: Specific products, brands, exact matches

Hybrid Search:

Combines vector + text with weighted scoring
Vector weight: 0.7, Text weight: 0.3
Best of both worlds

Filtering & Diversity

Filters Applied:

Exclusion Filter: Remove previously shown products
Budget Filter: price >= budget.min && price <= budget.max
Language Filter: For books, filter by bookLanguage
Constraint Filter: Apply negative constraints (MITTE raamat)
Author Filter: For author searches, match author name

Diversity Enhancements:

Category Diversity: Mix of product types (not all books)
Price Diversity: Spread across budget range (low/mid/high)
Gender Boost: Prioritize gender-appropriate products
Freshness: Include some newer products

Example:

Query: "Valentine gifts under 100 euro for girlfriend"

Without Diversity:

5× Romantic novels (all books, 15-20€)

With Diversity:

1× Romantic novel (18€)
1× Scented candle set (32€)
1× Jewelry (45€)
1× Spa gift set (28€)
1× Chocolate box (15€)

Semantic Reranking

Purpose: Final quality ranking based on semantic relevance

Provider: Cohere Rerank API (fallback: OpenAI embeddings)

How It Works:

Take top 20-30 candidates from search
Send query + product titles to reranking API
Get relevance scores (0-1)
Sort by score, return top 5-7

Benefits:

Better relevance than pure keyword matching
Understands nuanced queries ("romantic but practical")
Cross-lingual matching (Estonian query → English products)

Location: app/api/chat/services/search/

7. Response Orchestrator

Purpose: Generate streaming chat responses with product card injection

Response Pipeline:

System Prompt Generation

Purpose: Dynamic prompt based on context and products

Components:

Base Persona: Gift recommendation expert, friendly, Estonian/English
Product Context: Injected product details (title, price, category)
User Context: Occasion, recipient, budget from GiftContext
Conversation History: Recent messages for continuity
Constraints: Apply user preferences (e.g., "no books")

Example:

// Generated for "Gift for mother's birthday under 50 euro"
{
  system: `You are a friendly Estonian gift recommendation expert.
  
  User Context:
  - Occasion: Birthday (sünnipäev)
  - Recipient: Mother (ema)
  - Budget: Under 50 euro
  
  Available Products:
  1. "Kaunid lillevaas" - 32.99€ (Home & Garden)
  2. "Lõhnaküünal lavendel" - 18.50€ (Candles)
  3. "Kinkeraamat 'Südamega kokk'" - 24.99€ (Books)
  ...
  
  Provide personalized recommendations explaining why each gift suits the recipient.`,
  
  temperature: 0.8,
  model: "gpt-4o"
}

Product Card Injection

Purpose: Insert structured product cards into streaming response

Format:

{
  "type": "product",
  "id": "product_123",
  "title": "Kaunid lillevaas",
  "price": 32.99,
  "category": "Kodu ja aed",
  "image": "https://...",
  "url": "https://...",
  "reasoning": "Perfect for mother who loves flowers..."
}

Injection Point: After initial explanation text, before closing remarks

Benefits:

Structured data for frontend rendering
Seamless integration with streaming text
Progressive loading (products appear as stream progresses)

Smart Suggestions

Purpose: Follow-up question suggestions to continue conversation

Types:

Show More: "Näita veel tooteid" (show more products)
Refine Budget: "Näita odavamaid variante" (cheaper alternatives)
Category Explore: "Näita raamatuid" (explore specific category)
Related Occasions: "Mis sobib emadepäevaks?" (related occasions)

Generation: Dynamic based on context and search results

Example:

{
  "suggestions": [
    { "text": "Näita veel sünnipäevakingitusi", "intent": "show_more" },
    { "text": "Mis sobib alla 30 euro?", "intent": "budget_alternatives" },
    { "text": "Näita kinkeraamatuid", "intent": "category_explore" }
  ]
}

Location: app/api/chat/services/response/

8. State Persistence (Convex)

Purpose: Store conversation state for follow-ups and pronoun resolution

Stored Data:

interface ConversationState {
  conversationId: string;
  userId?: string;
  
  // Author Context (for pronoun resolution)
  primaryAuthor?: string;
  authors?: string[];
  
  // Taxonomy Persistence
  lastProductType?: string;
  lastCategory?: string;
  
  // Exclusions (for "show more")
  excludedProductIds?: string[];
  
  // Budget Context
  lastBudget?: { min?: number; max?: number };
  
  // Metadata
  lastUpdated: number;
  messageCount: number;
}

Use Cases:

Pronoun Resolution: "tema teosed" → resolves to last mentioned author
Show More: Excludes previously shown products
Budget Persistence: Remembers budget across turns
Taxonomy Continuity: "näita raamatuid" → uses last recipient/occasion

Flow:

Location: convex/actions/setConversationContext.ts, convex/schema.ts

🚦 Routing & Safeguards

1. Deterministic Bypass

Purpose: Skip LLM calls for obvious category keywords

Patterns:

kinkekaart, gift card → productType: "Kinkekaart"
raamat, book → productType: "Raamat"
mäng, game → productType: "Mängud"

Benefits:

Zero latency
Zero cost
100% accuracy

2. Fast-Path Allowlist

Purpose: Only safe intents can use fast classifier short-circuit

Allowed Intents:

show_more_products - Simple pagination
greeting - No search needed
question - General questions
cheaper_alternatives - Budget refinement
budget_alternatives - Budget refinement
purchase_confirmation - Confirmation
Occasion intents - Clear intent, no ambiguity

Blocked Intents:

product_search - Needs full context
author_search - Needs pronoun resolution
product_inquiry - Needs product details

3. Author/Pronoun Guard

Purpose: Prevent fast classifier from hijacking pronoun queries

Detection:

// Pattern 1: Explicit author names
const hasAuthorPattern = /\b[A-ZÕÄÖÜõäöü][a-zõäöü]+...(?:lt|i\s+teosed)/i;

// Pattern 2: Author pronouns
const hasAuthorPronoun = /\b(tema|teda|temalt|selle\s+autori|that\s+author)/i;

Action: Skip fast classifier, force enhanced LLM extraction

Example:

Query: "näita veel tema teoseid"
Without guard: Fast classifier → intent: show_more_products
With guard: Enhanced LLM → intent: author_search, resolve pronoun → authorName: "Tolkien"

4. Confidence-Aware Routing

Thresholds:

High (≥ 0.7): Execute search directly
Medium (0.5-0.7): Search if has product signals, otherwise clarify
Low (< 0.5): Ask clarifying question

Signal Detection: See Context Signals Documentation

Configuration & Feature Flags

Environment Variables

# === Context Extraction ===
# Enable parallel race between classifier and main extractor
PARALLEL_CONTEXT_EXTRACTION_ENABLED=true

# Head start for classifier (ms) before main extractor starts
PARALLEL_CLASSIFIER_HEADSTART_MS=200

# Force-skip fast classifier globally
CONTEXT_CLASSIFIER_DISABLED=false

# Enable enhanced semantic prompt with conversation state
ENHANCED_SEMANTIC_PROMPT=true

# Enable hierarchical category refinement
HIERARCHICAL_CATEGORY_ENABLED=true

# === Orchestration ===
# Enable parallel orchestrator for optimized TTFC
PARALLEL_EXECUTION_ENABLE=true

# Enable clarifying questions for vague queries
CLARIFYING_QUESTIONS_ENABLED=true

# === Models ===
# Context extraction model (Groq)
CONTEXT_EXTRACTION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct

# Fast classifier model (Groq)
FAST_CLASSIFIER_MODEL=meta-llama/llama-4-scout-17b-16e-instruct

# Response generation model (OpenAI)
OPENAI_MODEL=gpt-4o

# === Search ===
# Enable search result randomization for diversity
ENABLE_SEARCH_RANDOMIZE=true

# Enable semantic reranking
ENABLE_SEMANTIC_RERANK=true

# === Debugging ===
# Enable verbose debug logging
CHAT_DEBUG_LOGS=true

# Enable search debug logs
SEARCH_DEBUG_LOGS=true

Data Flow Summary

Input Data

// HTTP Request
POST /api/chat
Content-Type: application/json

{
  "messages": [
    { "role": "user", "content": "Kingitus emale sünnipäevaks alla 50 euro" }
  ],
  "conversationId": "conv_123"
}

Intermediate Data

// GiftContext (after extraction)
{
  intent: "birthday_gift",
  occasion: "sünnipäev",
  recipient: "ema",
  productType: "Kingitused",
  productTypeHints: ["Raamat", "Kodu ja aed", "Ilu ja stiil"],
  budget: { max: 50 },
  language: "et",
  confidence: 0.8,
  meta: {
    classifierUsed: true,
    extractionDurationMs: 245
  }
}

Output Data

// Streaming Response
{
  "type": "text",
  "content": "Siin on mõned sobivad sünnipäevakingitused teie emale:\n\n"
}

{
  "type": "product",
  "id": "prod_456",
  "title": "Kaunid lillevaas",
  "price": 32.99,
  "category": "Kodu ja aed",
  "image": "...",
  "reasoning": "Ilus ja praktiline, sobib kodu kaunistamiseks..."
}

{
  "type": "text",
  "content": "\n\nKõik need kingitused mahuvad teie eelarve..."
}

{
  "type": "suggestions",
  "items": [
    "Näita veel sünnipäevakingitusi",
    "Mis sobib alla 30 euro?"
  ]
}

Observability & Debugging

Debug Logging

Enable:

export CHAT_DEBUG_LOGS=true
export SEARCH_DEBUG_LOGS=true

Output Locations:

Context Extraction:

 FAST CLASSIFIER CALLED: { query: "kingitus emale...", timestamp: "..." }
 FAST CLASSIFIER RESULT: { intent: "birthday_gift", confidence: 0.8, durationMs: 245 }

Routing Decision:

 ROUTING DECISION: {
  intent: "product_search",
  confidence: 0.8,
  hasProductSignals: true,
  handler: "ProductSearchHandler"
}

Search Execution:

 MULTI-QUERY SEARCH: 3 variations generated
 SEARCH RESULTS: { vector: 12, text: 8, hybrid: 15, merged: 20 }
 DIVERSITY APPLIED: { before: 20, after: 7, categoryMix: true }

Response Generation:

 RESPONSE STARTED: streaming enabled
 PRODUCT INJECTION: 5 products injected at position 245
 RESPONSE COMPLETE: { totalTokens: 456, durationMs: 1234 }

Telemetry Fields

GiftContext Meta:

{
  classifierUsed: boolean,
  classifierConfidence: number,
  classifierDurationMs: number,
  fallbackTriggered: boolean,
  extractionDurationMs: number,
  parallelMode: boolean,
  hierarchicalUsed: boolean,
  // ... more fields
}

Use Cases:

Performance monitoring
A/B testing (classifier vs main extractor)
Confidence calibration
Error tracking

Performance Characteristics

Latency Breakdown (Typical)

Stage	Sequential	Parallel	Improvement
TTFC (Time to First Chunk)	800-1200ms	<100ms	8-12x faster
Context Extraction	200-400ms	200-400ms	Same
Product Search	300-500ms	300-500ms	Same
Response Generation	200-400ms	200-400ms	Same
Total	~1200-1500ms	~1200-1500ms	Same

Key Insight: Parallel mode doesn't reduce total time, but dramatically improves perceived performance by showing immediate feedback.

Cost Optimization

LLM Call Hierarchy (cheapest → most expensive):

Deterministic Bypass: $0 (no LLM)
Fast Classifier: ~$0.0001 per query (Groq Llama, 120 tokens)
Main Extractor: ~$0.0005 per query (Groq Llama, 500 tokens)
Response Generation: ~$0.02 per query (OpenAI GPT-4o, 1000 tokens)

Cost Savings:

Deterministic bypass: ~10% of queries (100% cost saving)
Fast classifier fast-path: ~30% of queries (40% cost saving on extraction)
Total extraction savings: ~15-20% vs always using main extractor

Core Systems

Context Understanding - Context extraction deep dive
Fast Classifier - Fast classifier implementation
Query Specificity Detection - Specific vs vague queries
Context Signals - Signal detection system

Handlers & Orchestration

Orchestration System - Orchestrator patterns
Handler Routing - Handler selection logic

Search & Ranking

Product Search - Search implementation
Semantic Reranking - Reranking strategy

Response Generation

Response Streaming - Streaming implementation
System Prompts - Prompt engineering

System Evolution

Current Version

Parallel orchestration for sub-100ms TTFC
Three-stage context extraction (bypass → classifier → main)
Multi-strategy search with semantic reranking
Streaming responses with mid-stream product injection

Recent Improvements

Added fast classifier for low-latency intent detection
Implemented parallel extraction race mode
Added author/pronoun skip guards
Enhanced diversity layer in search
Improved confidence scoring with signal detection

Future Roadmap

🔜 Personalization based on user history
🔜 Multi-turn conversation memory
🔜 Image-based product recommendations
🔜 Voice input support
🔜 Real-time inventory integration

Quick Reference

Key Files

Component	File Path	Lines
HTTP Entry	`app/api/chat/route.ts`	1-522
Parallel Orchestrator	`orchestrators/parallel-orchestrator.ts`	56-899
Context Understanding	`services/context-understanding/index.ts`	75-995
Fast Classifier	`services/context-understanding/fast-classifier.ts`	27-178
Handler Router	`handlers/handler-router.ts`	-
Search Orchestrator	`services/search/`	-
Response Orchestrator	`services/response/`	-

Key Concepts

TTFC: Time to First Chunk (target: <100ms)
GiftContext: Structured intent representation
Fast Path: Fast classifier short-circuit
Signal Detection: Meaningful vs fallback signals
Multi-Query Search: Parallel search strategies
Semantic Reranking: Final relevance scoring

Last Updated: 2025-01-17
Version: 2.0
Status: Production Ready

Overview​

System Architecture Overview​

📋 Request Lifecycle (Happy Path)​

Step-by-Step Flow​

🔧 Key Components Deep Dive​

1. HTTP Entry Point (route.ts)​

2. Parallel Orchestrator (ParallelOrchestrator)​

3. Context Understanding System​

Stage 1: Deterministic Bypass​

Stage 2: Fast Classifier​

Stage 3: Main Extractor​

4. GiftContext Structure​

5. Handler Router​

6. Search Orchestrator​

Query Rewriting​

Multi-Query Search​

Filtering & Diversity​

Semantic Reranking​

7. Response Orchestrator​

System Prompt Generation​

Product Card Injection​

Smart Suggestions​

8. State Persistence (Convex)​

🚦 Routing & Safeguards​

1. Deterministic Bypass​

2. Fast-Path Allowlist​

3. Author/Pronoun Guard​

4. Confidence-Aware Routing​

Configuration & Feature Flags​

Environment Variables​

Data Flow Summary​

Input Data​

Intermediate Data​

Output Data​

Observability & Debugging​

Debug Logging​

Telemetry Fields​

Performance Characteristics​

Latency Breakdown (Typical)​

Cost Optimization​

Related Documentation​

Core Systems​

Handlers & Orchestration​

Search & Ranking​

Response Generation​

System Evolution​

Current Version​

Recent Improvements​

Future Roadmap​

Quick Reference​

Key Files​

Key Concepts​

Overview

System Architecture Overview

📋 Request Lifecycle (Happy Path)

Step-by-Step Flow

🔧 Key Components Deep Dive

1. HTTP Entry Point (`route.ts`)

2. Parallel Orchestrator (`ParallelOrchestrator`)

3. Context Understanding System

Stage 1: Deterministic Bypass

Stage 2: Fast Classifier

Stage 3: Main Extractor

4. GiftContext Structure

5. Handler Router

6. Search Orchestrator

Query Rewriting

Multi-Query Search

Filtering & Diversity

Semantic Reranking

7. Response Orchestrator

System Prompt Generation

Product Card Injection

Smart Suggestions

8. State Persistence (Convex)

🚦 Routing & Safeguards

1. Deterministic Bypass

2. Fast-Path Allowlist

3. Author/Pronoun Guard

4. Confidence-Aware Routing

Configuration & Feature Flags

Environment Variables

Data Flow Summary

Input Data

Intermediate Data

Output Data

Observability & Debugging

Debug Logging

Telemetry Fields

Performance Characteristics

Latency Breakdown (Typical)

Cost Optimization

Related Documentation

Core Systems

Handlers & Orchestration

Search & Ranking

Response Generation

System Evolution

Current Version

Recent Improvements

Future Roadmap

Quick Reference

Key Files

Key Concepts