High-Level Architecture
Overview
This document provides a comprehensive overview of how the Agentic RAG Gift Recommendation System processes user messages from HTTP request to personalized product recommendations. The system combines LLM-powered context understanding, intelligent routing, multi-strategy search, and streaming response generation to deliver a ChatGPT-like gift shopping experience.
Core Goal: Transform conversational queries like "I need a gift for my mother's birthday under 50 euro" into personalized, relevant product recommendations in under 1 second.
System Architecture Overview
📋 Request Lifecycle (Happy Path)
Step-by-Step Flow
🔧 Key Components Deep Dive
1. HTTP Entry Point (route.ts)
Purpose: HTTP handler for /api/chat endpoint
Responsibilities:
- Parse incoming requests
- Validate API keys and environment
- Route to appropriate orchestrator (Parallel vs Sequential)
- Handle CORS and error responses
- Apply response headers
Flow:
Location: app/api/chat/route.ts
2. Parallel Orchestrator (ParallelOrchestrator)
Purpose: Optimized request flow with skeleton responses for sub-800ms Time to First Chunk (TTFC)
Key Features:
- Skeleton Response: Emits initial response in <100ms while processing continues
- Parallel Execution: Context extraction and response generation run concurrently
- Smart Routing: Intent-based handler selection
- Validation Gates: Nonsense detection, query validation
Flow:
Performance Benefits:
- TTFC: <100ms (skeleton) vs 800-1200ms (sequential)
- Total Time: ~1200-1500ms (same as sequential, but feels faster)
- UX: Immediate feedback, progressive loading
Location: app/api/chat/orchestrators/parallel-orchestrator.ts
3. Context Understanding System
Purpose: Extract structured GiftContext from natural language queries
Three-Stage Pipeline:
Stage 1: Deterministic Bypass
When: Obvious product keywords detected (e.g., "gift card", "kinkekaart")
Benefits:
- Zero LLM cost
- ~0ms latency
- 100% accuracy for known patterns
Example:
- Query: "show me gift cards"
- Bypass: Match
kinkekaartpattern - Result:
productType: "Kinkekaart",confidence: 1.0
Stage 2: Fast Classifier
When: No bypass match, no author/pronoun patterns
Model: meta-llama/llama-4-scout-17b-16e-instruct (Groq)
Timeout: 4 seconds
Fast Path Intents:
show_more_productsgreetingquestioncheaper_alternativesbudget_alternativespurchase_confirmation- Occasion-specific intents (Valentine's, Mother's Day, etc.)
Confidence Threshold: ≥ 0.1
Flow:
Stage 3: Main Extractor
When: Fast path fails, or author/pronoun detected
Model: meta-llama/llama-4-scout-17b-16e-instruct (Groq)
Timeout: 25 seconds
Features:
- Enhanced Semantic Prompt: Richer examples, pronoun resolution
- Conversation State Injection: Resolves "tema" → actual author name
- Full Context: Complete intent taxonomy, constraints, age detection
- Hierarchical Refinement: Optional category refinement when low confidence
Output: Complete GiftContext with all fields populated
4. GiftContext Structure
Purpose: Structured representation of user intent
interface GiftContext {
// Core Intent
intent: string; // "product_search", "author_search", etc.
confidence: number; // 0.0 - 1.0
// Gift Context
occasion?: string; // "sünnipäev", "valentinipäev", etc.
recipient?: string; // "ema", "sõber", "kolleeg", etc.
recipientGender?: 'male' | 'female' | 'unisex' | 'unknown';
ageGroup?: 'child' | 'teen' | 'adult' | 'elderly' | 'unknown';
ageBracket?: AgeBracket; // Fine-grained age ranges
recipientAge?: number;
// Product Signals
productType?: string; // "Raamat", "Mängud", "Kinkekaart", etc.
category?: string; // Specific category
productTypeHints?: string[]; // Multiple type suggestions
categoryHints?: string[]; // Multiple category suggestions
// Budget
budget?: {
min?: number;
max?: number;
hint?: string; // "affordable", "luxury", etc.
};
// Constraints
constraints?: string[]; // ["MITTE raamat", "eco-friendly", etc.]
// Author/Book Context
authorName?: string;
bookLanguage?: 'et' | 'en';
// Metadata
language: 'et' | 'en' | 'mixed';
isPopularQuery?: boolean;
timestamp?: number;
meta?: GiftContextMeta; // Telemetry
}
5. Handler Router
Purpose: Route requests to appropriate handler based on intent and context
Routing Logic:
Handlers:
| Handler | Intent Types | Purpose |
|---|---|---|
| ProductSearchHandler | product_search, author_search, occasion intents | Execute product search |
| ConversationalHandler | greeting, question, thank_you | Non-product responses |
| ClarifyingQuestionHandler | Low confidence queries | Ask clarifying questions |
| ProductInquiryHandler | product_inquiry | Answer questions about specific products |
| ShowMoreHandler | show_more_products | Pagination, more results |
Location: app/api/chat/handlers/handler-router.ts
6. Search Orchestrator
Purpose: Execute multi-strategy product search with quality filters
Search Pipeline:
Query Rewriting
Purpose: Generate multiple search queries to maximize recall
Strategies:
- Specific queries: Focus on detected product type/category
- Exploratory queries: Broaden to related categories
- Occasion-specific: Add occasion context (e.g., "birthday gift")
- Recipient-specific: Add recipient context (e.g., "for mother")
- Fallback queries: Generic gift searches as backup
Example:
Input: "Gift for mother's birthday under 50 euro"
Variations:
- "sünnipäevakingitus emale" (specific, Estonian)
- "kingitus emale" (broader)
- "birthday gift for mother" (English)
- "emadepäeva kingitus" (occasion alternative)
- "naistele kingitus" (gender-based fallback)
Multi-Query Search
Purpose: Execute multiple search strategies in parallel
Search Types:
Vector Search (Convex):
- Uses embeddings for semantic similarity
- Good for: "romantic gift", "practical gift", concepts
- Model: OpenAI
text-embedding-3-small
Text Search (Convex):
- Keyword matching on title, category, product type
- Good for: Specific products, brands, exact matches
Hybrid Search:
- Combines vector + text with weighted scoring
- Vector weight: 0.7, Text weight: 0.3
- Best of both worlds
Filtering & Diversity
Filters Applied:
- Exclusion Filter: Remove previously shown products
- Budget Filter:
price >= budget.min && price <= budget.max - Language Filter: For books, filter by
bookLanguage - Constraint Filter: Apply negative constraints (
MITTE raamat) - Author Filter: For author searches, match author name
Diversity Enhancements:
- Category Diversity: Mix of product types (not all books)
- Price Diversity: Spread across budget range (low/mid/high)
- Gender Boost: Prioritize gender-appropriate products
- Freshness: Include some newer products
Example:
Query: "Valentine gifts under 100 euro for girlfriend"
Without Diversity:
- 5× Romantic novels (all books, 15-20€)
With Diversity:
- 1× Romantic novel (18€)
- 1× Scented candle set (32€)
- 1× Jewelry (45€)
- 1× Spa gift set (28€)
- 1× Chocolate box (15€)
Semantic Reranking
Purpose: Final quality ranking based on semantic relevance
Provider: Cohere Rerank API (fallback: OpenAI embeddings)
How It Works:
- Take top 20-30 candidates from search
- Send query + product titles to reranking API
- Get relevance scores (0-1)
- Sort by score, return top 5-7
Benefits:
- Better relevance than pure keyword matching
- Understands nuanced queries ("romantic but practical")
- Cross-lingual matching (Estonian query → English products)
Location: app/api/chat/services/search/
7. Response Orchestrator
Purpose: Generate streaming chat responses with product card injection
Response Pipeline:
System Prompt Generation
Purpose: Dynamic prompt based on context and products
Components:
- Base Persona: Gift recommendation expert, friendly, Estonian/English
- Product Context: Injected product details (title, price, category)
- User Context: Occasion, recipient, budget from
GiftContext - Conversation History: Recent messages for continuity
- Constraints: Apply user preferences (e.g., "no books")
Example:
// Generated for "Gift for mother's birthday under 50 euro"
{
system: `You are a friendly Estonian gift recommendation expert.
User Context:
- Occasion: Birthday (sünnipäev)
- Recipient: Mother (ema)
- Budget: Under 50 euro
Available Products:
1. "Kaunid lillevaas" - 32.99€ (Home & Garden)
2. "Lõhnaküünal lavendel" - 18.50€ (Candles)
3. "Kinkeraamat 'Südamega kokk'" - 24.99€ (Books)
...
Provide personalized recommendations explaining why each gift suits the recipient.`,
temperature: 0.8,
model: "gpt-4o"
}
Product Card Injection
Purpose: Insert structured product cards into streaming response
Format:
{
"type": "product",
"id": "product_123",
"title": "Kaunid lillevaas",
"price": 32.99,
"category": "Kodu ja aed",
"image": "https://...",
"url": "https://...",
"reasoning": "Perfect for mother who loves flowers..."
}
Injection Point: After initial explanation text, before closing remarks
Benefits:
- Structured data for frontend rendering
- Seamless integration with streaming text
- Progressive loading (products appear as stream progresses)
Smart Suggestions
Purpose: Follow-up question suggestions to continue conversation
Types:
- Show More: "Näita veel tooteid" (show more products)
- Refine Budget: "Näita odavamaid variante" (cheaper alternatives)
- Category Explore: "Näita raamatuid" (explore specific category)
- Related Occasions: "Mis sobib emadepäevaks?" (related occasions)
Generation: Dynamic based on context and search results
Example:
{
"suggestions": [
{ "text": "Näita veel sünnipäevakingitusi", "intent": "show_more" },
{ "text": "Mis sobib alla 30 euro?", "intent": "budget_alternatives" },
{ "text": "Näita kinkeraamatuid", "intent": "category_explore" }
]
}
Location: app/api/chat/services/response/
8. State Persistence (Convex)
Purpose: Store conversation state for follow-ups and pronoun resolution
Stored Data:
interface ConversationState {
conversationId: string;
userId?: string;
// Author Context (for pronoun resolution)
primaryAuthor?: string;
authors?: string[];
// Taxonomy Persistence
lastProductType?: string;
lastCategory?: string;
// Exclusions (for "show more")
excludedProductIds?: string[];
// Budget Context
lastBudget?: { min?: number; max?: number };
// Metadata
lastUpdated: number;
messageCount: number;
}
Use Cases:
- Pronoun Resolution: "tema teosed" → resolves to last mentioned author
- Show More: Excludes previously shown products
- Budget Persistence: Remembers budget across turns
- Taxonomy Continuity: "näita raamatuid" → uses last recipient/occasion
Flow:
Location: convex/actions/setConversationContext.ts, convex/schema.ts
🚦 Routing & Safeguards
1. Deterministic Bypass
Purpose: Skip LLM calls for obvious category keywords
Patterns:
kinkekaart,gift card→productType: "Kinkekaart"raamat,book→productType: "Raamat"mäng,game→productType: "Mängud"
Benefits:
- Zero latency
- Zero cost
- 100% accuracy
2. Fast-Path Allowlist
Purpose: Only safe intents can use fast classifier short-circuit
Allowed Intents:
show_more_products- Simple paginationgreeting- No search neededquestion- General questionscheaper_alternatives- Budget refinementbudget_alternatives- Budget refinementpurchase_confirmation- Confirmation- Occasion intents - Clear intent, no ambiguity
Blocked Intents:
product_search- Needs full contextauthor_search- Needs pronoun resolutionproduct_inquiry- Needs product details
3. Author/Pronoun Guard
Purpose: Prevent fast classifier from hijacking pronoun queries
Detection:
// Pattern 1: Explicit author names
const hasAuthorPattern = /\b[A-ZÕÄÖÜõäöü][a-zõäöü]+...(?:lt|i\s+teosed)/i;
// Pattern 2: Author pronouns
const hasAuthorPronoun = /\b(tema|teda|temalt|selle\s+autori|that\s+author)/i;
Action: Skip fast classifier, force enhanced LLM extraction
Example:
- Query: "näita veel tema teoseid"
- Without guard: Fast classifier →
intent: show_more_products - With guard: Enhanced LLM →
intent: author_search, resolve pronoun →authorName: "Tolkien"
4. Confidence-Aware Routing
Thresholds:
- High (≥ 0.7): Execute search directly
- Medium (0.5-0.7): Search if has product signals, otherwise clarify
- Low (< 0.5): Ask clarifying question
Signal Detection: See Context Signals Documentation
Configuration & Feature Flags
Environment Variables
# === Context Extraction ===
# Enable parallel race between classifier and main extractor
PARALLEL_CONTEXT_EXTRACTION_ENABLED=true
# Head start for classifier (ms) before main extractor starts
PARALLEL_CLASSIFIER_HEADSTART_MS=200
# Force-skip fast classifier globally
CONTEXT_CLASSIFIER_DISABLED=false
# Enable enhanced semantic prompt with conversation state
ENHANCED_SEMANTIC_PROMPT=true
# Enable hierarchical category refinement
HIERARCHICAL_CATEGORY_ENABLED=true
# === Orchestration ===
# Enable parallel orchestrator for optimized TTFC
PARALLEL_EXECUTION_ENABLE=true
# Enable clarifying questions for vague queries
CLARIFYING_QUESTIONS_ENABLED=true
# === Models ===
# Context extraction model (Groq)
CONTEXT_EXTRACTION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
# Fast classifier model (Groq)
FAST_CLASSIFIER_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
# Response generation model (OpenAI)
OPENAI_MODEL=gpt-4o
# === Search ===
# Enable search result randomization for diversity
ENABLE_SEARCH_RANDOMIZE=true
# Enable semantic reranking
ENABLE_SEMANTIC_RERANK=true
# === Debugging ===
# Enable verbose debug logging
CHAT_DEBUG_LOGS=true
# Enable search debug logs
SEARCH_DEBUG_LOGS=true
Data Flow Summary
Input Data
// HTTP Request
POST /api/chat
Content-Type: application/json
{
"messages": [
{ "role": "user", "content": "Kingitus emale sünnipäevaks alla 50 euro" }
],
"conversationId": "conv_123"
}
Intermediate Data
// GiftContext (after extraction)
{
intent: "birthday_gift",
occasion: "sünnipäev",
recipient: "ema",
productType: "Kingitused",
productTypeHints: ["Raamat", "Kodu ja aed", "Ilu ja stiil"],
budget: { max: 50 },
language: "et",
confidence: 0.8,
meta: {
classifierUsed: true,
extractionDurationMs: 245
}
}
Output Data
// Streaming Response
{
"type": "text",
"content": "Siin on mõned sobivad sünnipäevakingitused teie emale:\n\n"
}
{
"type": "product",
"id": "prod_456",
"title": "Kaunid lillevaas",
"price": 32.99,
"category": "Kodu ja aed",
"image": "...",
"reasoning": "Ilus ja praktiline, sobib kodu kaunistamiseks..."
}
{
"type": "text",
"content": "\n\nKõik need kingitused mahuvad teie eelarve..."
}
{
"type": "suggestions",
"items": [
"Näita veel sünnipäevakingitusi",
"Mis sobib alla 30 euro?"
]
}
Observability & Debugging
Debug Logging
Enable:
export CHAT_DEBUG_LOGS=true
export SEARCH_DEBUG_LOGS=true
Output Locations:
- Context Extraction:
FAST CLASSIFIER CALLED: { query: "kingitus emale...", timestamp: "..." }
FAST CLASSIFIER RESULT: { intent: "birthday_gift", confidence: 0.8, durationMs: 245 }
- Routing Decision:
ROUTING DECISION: {
intent: "product_search",
confidence: 0.8,
hasProductSignals: true,
handler: "ProductSearchHandler"
}
- Search Execution:
MULTI-QUERY SEARCH: 3 variations generated
SEARCH RESULTS: { vector: 12, text: 8, hybrid: 15, merged: 20 }
DIVERSITY APPLIED: { before: 20, after: 7, categoryMix: true }
- Response Generation:
RESPONSE STARTED: streaming enabled
PRODUCT INJECTION: 5 products injected at position 245
RESPONSE COMPLETE: { totalTokens: 456, durationMs: 1234 }
Telemetry Fields
GiftContext Meta:
{
classifierUsed: boolean,
classifierConfidence: number,
classifierDurationMs: number,
fallbackTriggered: boolean,
extractionDurationMs: number,
parallelMode: boolean,
hierarchicalUsed: boolean,
// ... more fields
}
Use Cases:
- Performance monitoring
- A/B testing (classifier vs main extractor)
- Confidence calibration
- Error tracking
Performance Characteristics
Latency Breakdown (Typical)
| Stage | Sequential | Parallel | Improvement |
|---|---|---|---|
| TTFC (Time to First Chunk) | 800-1200ms | <100ms | 8-12x faster |
| Context Extraction | 200-400ms | 200-400ms | Same |
| Product Search | 300-500ms | 300-500ms | Same |
| Response Generation | 200-400ms | 200-400ms | Same |
| Total | ~1200-1500ms | ~1200-1500ms | Same |
Key Insight: Parallel mode doesn't reduce total time, but dramatically improves perceived performance by showing immediate feedback.
Cost Optimization
LLM Call Hierarchy (cheapest → most expensive):
- Deterministic Bypass: $0 (no LLM)
- Fast Classifier: ~$0.0001 per query (Groq Llama, 120 tokens)
- Main Extractor: ~$0.0005 per query (Groq Llama, 500 tokens)
- Response Generation: ~$0.02 per query (OpenAI GPT-4o, 1000 tokens)
Cost Savings:
- Deterministic bypass: ~10% of queries (100% cost saving)
- Fast classifier fast-path: ~30% of queries (40% cost saving on extraction)
- Total extraction savings: ~15-20% vs always using main extractor
Related Documentation
Core Systems
- Context Understanding - Context extraction deep dive
- Fast Classifier - Fast classifier implementation
- Query Specificity Detection - Specific vs vague queries
- Context Signals - Signal detection system
Handlers & Orchestration
- Orchestration System - Orchestrator patterns
- Handler Routing - Handler selection logic
Search & Ranking
- Product Search - Search implementation
- Semantic Reranking - Reranking strategy
Response Generation
- Response Streaming - Streaming implementation
- System Prompts - Prompt engineering
System Evolution
Current Version
- Parallel orchestration for sub-100ms TTFC
- Three-stage context extraction (bypass → classifier → main)
- Multi-strategy search with semantic reranking
- Streaming responses with mid-stream product injection
Recent Improvements
- Added fast classifier for low-latency intent detection
- Implemented parallel extraction race mode
- Added author/pronoun skip guards
- Enhanced diversity layer in search
- Improved confidence scoring with signal detection
Future Roadmap
- 🔜 Personalization based on user history
- 🔜 Multi-turn conversation memory
- 🔜 Image-based product recommendations
- 🔜 Voice input support
- 🔜 Real-time inventory integration
Quick Reference
Key Files
| Component | File Path | Lines |
|---|---|---|
| HTTP Entry | app/api/chat/route.ts | 1-522 |
| Parallel Orchestrator | orchestrators/parallel-orchestrator.ts | 56-899 |
| Context Understanding | services/context-understanding/index.ts | 75-995 |
| Fast Classifier | services/context-understanding/fast-classifier.ts | 27-178 |
| Handler Router | handlers/handler-router.ts | - |
| Search Orchestrator | services/search/ | - |
| Response Orchestrator | services/response/ | - |
Key Concepts
- TTFC: Time to First Chunk (target: <100ms)
- GiftContext: Structured intent representation
- Fast Path: Fast classifier short-circuit
- Signal Detection: Meaningful vs fallback signals
- Multi-Query Search: Parallel search strategies
- Semantic Reranking: Final relevance scoring
Last Updated: 2025-01-17
Version: 2.0
Status: Production Ready