Orchestration System Architecture
AI-Powered Gift Recommendation System
Multi-layered orchestration architecture for intelligent product search and recommendation
Table of Contents
- System Overview
- Architecture Diagram
- Core Components
- Execution Flows
- Performance Optimizations
- Configuration & Toggles
- Data Flow Examples
System Overview
The orchestration system is a sophisticated multi-layered architecture that coordinates the entire gift recommendation pipeline, from user query to final AI response. It consists of four primary orchestrators, multiple handlers, and specialized services that work together to deliver sub-second response times while maintaining high-quality recommendations.
Key Features
- Parallel Execution: Optimized flow with <800ms TTFC (Time To First Chunk)
- Context-Aware Search: Multi-stage filtering with semantic reranking
- Intelligent Routing: Intent-based handler selection
- Streaming Responses: Real-time product card injection
- Graceful Degradation: Fallbacks at every critical junction
Performance Metrics
| Metric | Parallel Mode | Sequential Mode |
|---|---|---|
| TTFC | <800ms | ~slow |
| Context Extraction | fast (non-blocking) | ~1.9s (blocking) |
| Search Pipeline | optimized | optimized |
| AI First Chunk | very fast (skeleton) | ~slow |
Architecture Diagram
High-Level System Architecture
Detailed Orchestration Flow
Handler Routing Decision Tree
Core Components
1. ParallelOrchestrator
Purpose: Optimized execution flow that parallelizes context extraction with user feedback
Key Features:
- Non-blocking context extraction
- Immediate skeleton response (<100ms)
- Dynamic product injection during streaming
- Query validation before processing
- Intent-based routing with fallbacks
Performance Impact:
- dramatic improvement in TTFC (slow → <800ms)
- User sees response in <100ms (skeleton)
- AI starts streaming at ~1s
- Products appear at ~1.4s
Main Responsibilities:
1. Query Validation (nonsense detection)
2. Context Extraction (parallel, non-blocking)
3. Vague Intent Detection (multi-factor)
4. Search Orchestration (when needed)
5. Response Streaming (with delayed cards)
6. Context Persistence
Critical Logic:
- Vague Query Detection: Combines confidence, signals, and explicit mentions
- Book Fallback: Automatic category broadening for gift queries
- Memory Resolution: Check stored products before search
- Repetition Detection: Stop streaming on AI loops
2. ContextOrchestrator
Purpose: Extract, enrich, and manage conversation context
Sub-Components:
extract-context.ts- LLM-based intent extractionfetch-stored-context.ts- Retrieve conversation historyrefinement-signals.ts- Apply user feedback signalsauthor-workflow.ts- Author detection and clarificationexclude-reset.ts- Smart exclude list managementproduct-inquiry.ts- Follow-up question routingpersist-context.ts- Save context to database
Context Preservation: Preserves taxonomy for follow-up intents:
show_more_productscheaper_alternativesbudget_alternatives
Key Features:
- Multi-source context merging (LLM + DB + Client)
- Exclude list pruning (max 30 items)
- Category hints prioritization (frontend → DB)
- Budget constraint preservation
- Author clarification workflow
3. SearchOrchestrator
Purpose: Coordinate the complete search pipeline from query to final products
Search Pipeline (6 Phases):
Phase 1: Query Rewriting
- Generate query variations (primary + fallbacks)
- Apply focus strategies (semantic, category, type)
- Handle show_more special case
Phase 2: Multi-Stage Funnel
- Stage A: Initial filtering (max 100 candidates)
- Stage B: Budget & constraint filtering (max 50)
- Stage C: Category distribution (max 20 finalists)
Phase 3: LLM Semantic Reranking
- Cohere rerank-v3.5 scoring
- User intent alignment
- Quality-based filtering (0.5 threshold, fallback 0.3)
Phase 4: Diversity Selection
- Category diversity
- Price range distribution
- Product type balancing
- Final 3 selection
Phase 4.5: Gender Affinity Boost
- Category-gender affinity scoring
- Boost multiplier: 0.5x - 1.8x
- Re-sort after boosting
Phase 6: Estonian Product Prioritization
- Language-based boosting
- Cultural relevance scoring
Fallback Mechanisms:
- Book-only results: Auto-retry with gift categories
- Language fallback: Retry without language filter
- Gift card exclusion: EXCLUDE_GIFT_CARDS constraint
- Quality safety net: Minimum threshold 0.3
4. ResponseOrchestrator
Purpose: Generate and stream AI responses with dynamic product injection
Key Features:
- GPT-5.1 chat model (gpt-5.1-chat-latest)
- Delayed card injection (@180 chars)
- Token usage monitoring (2500 token limit)
- Repetition detection (consecutive & frequent)
- Fallback responses on failures
Response Modes:
-
Product Response (
generateWithDelayedCards):- Stream AI text first
- Inject product cards after 180 chars
- Include safety prefaces
- Add smart suggestions
- Track performance metrics
-
Conversational Response (
generateConversationalResponse):- No products, no skeleton
- Greeting/clarification handling
- Smart suggestion buttons
- Prompt compliance validation
-
Product Inquiry Response (
generateProductInquiryResponse):- Answer follow-up questions
- Use stored product data
- No new search
Quality Controls:
- Product mention detection (validation)
- Repetition detection (3+ consecutive words)
- Token limit warnings (>90% utilization)
- Cut-off handling (graceful ellipsis)
Execution Flows
Parallel Flow (Optimized)
Sequential Flow (Legacy)
Context Orchestration Detail
Search Orchestration Pipeline
Performance Optimizations
1. Parallel Execution Mode
Problem: Sequential context extraction blocked user feedback for ~1.9s
Solution: Parallel orchestration with immediate skeleton response
Benefits:
- TTFC: slow → <800ms (dramatic improvement)
- User perception: Instant feedback
- Context extraction: Non-blocking
Implementation:
// Old (Sequential)
context = await ContextOrchestrator.orchestrate() // 1.9s BLOCKING
search = await SearchOrchestrator.orchestrate()
response = await ResponseOrchestrator.generate()
// New (Parallel)
sendSkeleton() // 50ms
Promise.all([
contextPromise, // 900ms non-blocking
searchPrepPromise // 100ms
])
streamResponseImmediately() // <800ms TTFC
injectProductsDynamically()
2. Context Warmup
- OpenAI connection pre-warming
- LLM model caching
- Database connection pooling
3. Search Pipeline Optimizations
Stage Limits (Configured via SearchOrchestratorConfig):
MAX_CANDIDATES_STAGE_A = 100 // Down from 200
MAX_CANDIDATES_STAGE_B = 50 // Down from 100
MAX_FINALISTS = 20 // Down from 30
RERANK_MIN_FINALISTS = 3 // Skip rerank if < 3
Savings: ~200-300ms per request
4. Exclude List Pruning
Problem: Long conversations exhaust product pool
Solution: Keep only last 30 excludes (FIFO)
if (excludeIds.length > 30) {
excludeIds = excludeIds.slice(-30)
}
5. Smart Quality Fallbacks
Preferred Threshold: 0.5 (high quality)
Minimum Threshold: 0.3 (fallback)
if (highQualityProducts.length < 3) {
return mediumQualityProducts // Fallback
}
6. Repetition Detection
Stops streaming if AI loops:
- Consecutive: 3+ same words in a row
- Frequent: 3+ occurrences in 20-word window
7. Token Limit Monitoring
- Max tokens: 2500
- Warning at 90% utilization
- Graceful cut-off handling
Configuration & Toggles
Environment Variables
# Execution Mode
PARALLEL_EXECUTION_ENABLE=true # false = sequential (legacy)
# Context Management
PHASE5_CONTEXT_ENABLE=true # Enable context persistence
# Debug & Logging
CHAT_DEBUG_LOGS=true # Verbose logging
NODE_ENV=production # Production/development
# AI Models
OPENAI_API_KEY=sk-... # GPT-5.1 API key
# Database
NEXT_PUBLIC_CONVEX_URL=https://... # Convex backend
Search Orchestrator Config
File: orchestrators/search-orchestrator.config.ts
export class SearchOrchestratorConfig {
// Phase Toggles
static PHASE2_ENABLED = true; // Multi-stage funnel
static PHASE3_ENABLED = true; // LLM reranking
static PHASE4_ENABLED = true; // Diversity selection
static PHASE6_ENABLED = true; // Estonian boost
// Stage Limits (Performance Tuning)
static MAX_CANDIDATES_STAGE_A = 100;
static MAX_CANDIDATES_STAGE_B = 50;
static MAX_FINALISTS = 20;
static MAX_PER_CATEGORY = 5;
// Quality Thresholds
static PREFERRED_QUALITY_THRESHOLD = 0.5;
static MINIMUM_QUALITY_THRESHOLD = 0.3;
static RERANK_MIN_FINALISTS = 3;
// Diagnostics
static DIAGNOSTICS_ENABLED = false;
static AUTHOR_SPLIT_REGEX = /[,;]/;
static SHOW_MORE_REGEX = /\b(näita\s+rohkem|show\s+more|veel|more)\b/i;
}
Response Configuration
File: app/chat/config.ts
export const chatConfig = {
productDescriptions: {
maxWords: 250, // Max words per response
sentencesPerProduct: 3, // Sentences per product description
}
}
Data Flow Examples
Example 1: Show More Products
Example 2: Vague Query with Clarification
Example 3: Author Clarification Workflow
Appendix: Key Files
Orchestrators
app/api/chat/orchestrators/parallel-orchestrator.ts- Optimized flowapp/api/chat/orchestrators/context-orchestrator/orchestrate.ts- Context extractionapp/api/chat/orchestrators/search-orchestrator.ts- Search pipelineapp/api/chat/orchestrators/response-orchestrator.ts- AI response generation
Handlers
app/api/chat/handlers/handler-router.ts- Intent-based routingapp/api/chat/handlers/product-search-handler.ts- Product search flowapp/api/chat/handlers/clarifying-question-handler.ts- Clarification flowapp/api/chat/handlers/conversational-handler.ts- Conversational flow
Services
app/api/chat/services/query-rewriting/- Query generationapp/api/chat/services/product-search.ts- Multi-search executionapp/api/chat/services/funnel.ts- Multi-stage filteringapp/api/chat/services/rerank.ts- Semantic rerankingapp/api/chat/services/diversity.ts- Final selectionapp/api/chat/services/language.ts- Estonian boost
Configuration
app/api/chat/orchestrators/search-orchestrator.config.ts- Search configapp/chat/config.ts- Response config
Glossary
| Term | Definition |
|---|---|
| TTFC | Time To First Chunk - Time until user sees first AI response |
| Context Orchestration | Extract and manage conversation state |
| Search Orchestration | Multi-phase product search pipeline |
| Response Orchestration | AI response generation and streaming |
| Parallel Execution | Non-blocking context extraction with immediate feedback |
| Sequential Execution | Blocking context extraction before streaming |
| Funnel | Multi-stage candidate filtering (Stage A → B → C) |
| Reranking | LLM-based semantic scoring for relevance |
| Diversity Selection | Category and price distribution balancing |
| Skeleton | Empty product card placeholders for instant feedback |
| Delayed Cards | Product injection after AI text starts streaming |
| Context Preservation | Taxonomy inheritance for follow-up queries |
| Exclude List | Previously shown product IDs to avoid duplicates |
| Smart Suggestions | Category buttons for quick navigation |
Last Updated: 2025-11-16
Version: 1.0
Maintainer: AI Orchestration Team