Skip to main content

Orchestration System Architecture

AI-Powered Gift Recommendation System
Multi-layered orchestration architecture for intelligent product search and recommendation


Table of Contents

  1. System Overview
  2. Architecture Diagram
  3. Core Components
  4. Execution Flows
  5. Performance Optimizations
  6. Configuration & Toggles
  7. Data Flow Examples

System Overview

The orchestration system is a sophisticated multi-layered architecture that coordinates the entire gift recommendation pipeline, from user query to final AI response. It consists of four primary orchestrators, multiple handlers, and specialized services that work together to deliver sub-second response times while maintaining high-quality recommendations.

Key Features

  • Parallel Execution: Optimized flow with <800ms TTFC (Time To First Chunk)
  • Context-Aware Search: Multi-stage filtering with semantic reranking
  • Intelligent Routing: Intent-based handler selection
  • Streaming Responses: Real-time product card injection
  • Graceful Degradation: Fallbacks at every critical junction

Performance Metrics

MetricParallel ModeSequential Mode
TTFC<800ms~slow
Context Extractionfast (non-blocking)~1.9s (blocking)
Search Pipelineoptimizedoptimized
AI First Chunkvery fast (skeleton)~slow

Architecture Diagram

High-Level System Architecture

Detailed Orchestration Flow

Handler Routing Decision Tree


Core Components

1. ParallelOrchestrator

Purpose: Optimized execution flow that parallelizes context extraction with user feedback

Key Features:

  • Non-blocking context extraction
  • Immediate skeleton response (<100ms)
  • Dynamic product injection during streaming
  • Query validation before processing
  • Intent-based routing with fallbacks

Performance Impact:

  • dramatic improvement in TTFC (slow → <800ms)
  • User sees response in <100ms (skeleton)
  • AI starts streaming at ~1s
  • Products appear at ~1.4s

Main Responsibilities:

1. Query Validation (nonsense detection)
2. Context Extraction (parallel, non-blocking)
3. Vague Intent Detection (multi-factor)
4. Search Orchestration (when needed)
5. Response Streaming (with delayed cards)
6. Context Persistence

Critical Logic:

  • Vague Query Detection: Combines confidence, signals, and explicit mentions
  • Book Fallback: Automatic category broadening for gift queries
  • Memory Resolution: Check stored products before search
  • Repetition Detection: Stop streaming on AI loops

2. ContextOrchestrator

Purpose: Extract, enrich, and manage conversation context

Sub-Components:

  • extract-context.ts - LLM-based intent extraction
  • fetch-stored-context.ts - Retrieve conversation history
  • refinement-signals.ts - Apply user feedback signals
  • author-workflow.ts - Author detection and clarification
  • exclude-reset.ts - Smart exclude list management
  • product-inquiry.ts - Follow-up question routing
  • persist-context.ts - Save context to database

Context Preservation: Preserves taxonomy for follow-up intents:

  • show_more_products
  • cheaper_alternatives
  • budget_alternatives

Key Features:

  • Multi-source context merging (LLM + DB + Client)
  • Exclude list pruning (max 30 items)
  • Category hints prioritization (frontend → DB)
  • Budget constraint preservation
  • Author clarification workflow

3. SearchOrchestrator

Purpose: Coordinate the complete search pipeline from query to final products

Search Pipeline (6 Phases):

Phase 1: Query Rewriting

  • Generate query variations (primary + fallbacks)
  • Apply focus strategies (semantic, category, type)
  • Handle show_more special case

Phase 2: Multi-Stage Funnel

  • Stage A: Initial filtering (max 100 candidates)
  • Stage B: Budget & constraint filtering (max 50)
  • Stage C: Category distribution (max 20 finalists)

Phase 3: LLM Semantic Reranking

  • Cohere rerank-v3.5 scoring
  • User intent alignment
  • Quality-based filtering (0.5 threshold, fallback 0.3)

Phase 4: Diversity Selection

  • Category diversity
  • Price range distribution
  • Product type balancing
  • Final 3 selection

Phase 4.5: Gender Affinity Boost

  • Category-gender affinity scoring
  • Boost multiplier: 0.5x - 1.8x
  • Re-sort after boosting

Phase 6: Estonian Product Prioritization

  • Language-based boosting
  • Cultural relevance scoring

Fallback Mechanisms:

  • Book-only results: Auto-retry with gift categories
  • Language fallback: Retry without language filter
  • Gift card exclusion: EXCLUDE_GIFT_CARDS constraint
  • Quality safety net: Minimum threshold 0.3

4. ResponseOrchestrator

Purpose: Generate and stream AI responses with dynamic product injection

Key Features:

  • GPT-5.1 chat model (gpt-5.1-chat-latest)
  • Delayed card injection (@180 chars)
  • Token usage monitoring (2500 token limit)
  • Repetition detection (consecutive & frequent)
  • Fallback responses on failures

Response Modes:

  1. Product Response (generateWithDelayedCards):

    • Stream AI text first
    • Inject product cards after 180 chars
    • Include safety prefaces
    • Add smart suggestions
    • Track performance metrics
  2. Conversational Response (generateConversationalResponse):

    • No products, no skeleton
    • Greeting/clarification handling
    • Smart suggestion buttons
    • Prompt compliance validation
  3. Product Inquiry Response (generateProductInquiryResponse):

    • Answer follow-up questions
    • Use stored product data
    • No new search

Quality Controls:

  • Product mention detection (validation)
  • Repetition detection (3+ consecutive words)
  • Token limit warnings (>90% utilization)
  • Cut-off handling (graceful ellipsis)

Execution Flows

Parallel Flow (Optimized)

Sequential Flow (Legacy)

Context Orchestration Detail

Search Orchestration Pipeline


Performance Optimizations

1. Parallel Execution Mode

Problem: Sequential context extraction blocked user feedback for ~1.9s

Solution: Parallel orchestration with immediate skeleton response

Benefits:

  • TTFC: slow → <800ms (dramatic improvement)
  • User perception: Instant feedback
  • Context extraction: Non-blocking

Implementation:

// Old (Sequential)
context = await ContextOrchestrator.orchestrate() // 1.9s BLOCKING
search = await SearchOrchestrator.orchestrate()
response = await ResponseOrchestrator.generate()

// New (Parallel)
sendSkeleton() // 50ms
Promise.all([
contextPromise, // 900ms non-blocking
searchPrepPromise // 100ms
])
streamResponseImmediately() // &lt;800ms TTFC
injectProductsDynamically()

2. Context Warmup

  • OpenAI connection pre-warming
  • LLM model caching
  • Database connection pooling

3. Search Pipeline Optimizations

Stage Limits (Configured via SearchOrchestratorConfig):

MAX_CANDIDATES_STAGE_A = 100  // Down from 200
MAX_CANDIDATES_STAGE_B = 50 // Down from 100
MAX_FINALISTS = 20 // Down from 30
RERANK_MIN_FINALISTS = 3 // Skip rerank if < 3

Savings: ~200-300ms per request

4. Exclude List Pruning

Problem: Long conversations exhaust product pool

Solution: Keep only last 30 excludes (FIFO)

if (excludeIds.length > 30) {
excludeIds = excludeIds.slice(-30)
}

5. Smart Quality Fallbacks

Preferred Threshold: 0.5 (high quality)
Minimum Threshold: 0.3 (fallback)

if (highQualityProducts.length < 3) {
return mediumQualityProducts // Fallback
}

6. Repetition Detection

Stops streaming if AI loops:

  • Consecutive: 3+ same words in a row
  • Frequent: 3+ occurrences in 20-word window

7. Token Limit Monitoring

  • Max tokens: 2500
  • Warning at 90% utilization
  • Graceful cut-off handling

Configuration & Toggles

Environment Variables

# Execution Mode
PARALLEL_EXECUTION_ENABLE=true # false = sequential (legacy)

# Context Management
PHASE5_CONTEXT_ENABLE=true # Enable context persistence

# Debug & Logging
CHAT_DEBUG_LOGS=true # Verbose logging
NODE_ENV=production # Production/development

# AI Models
OPENAI_API_KEY=sk-... # GPT-5.1 API key

# Database
NEXT_PUBLIC_CONVEX_URL=https://... # Convex backend

Search Orchestrator Config

File: orchestrators/search-orchestrator.config.ts

export class SearchOrchestratorConfig {
// Phase Toggles
static PHASE2_ENABLED = true; // Multi-stage funnel
static PHASE3_ENABLED = true; // LLM reranking
static PHASE4_ENABLED = true; // Diversity selection
static PHASE6_ENABLED = true; // Estonian boost

// Stage Limits (Performance Tuning)
static MAX_CANDIDATES_STAGE_A = 100;
static MAX_CANDIDATES_STAGE_B = 50;
static MAX_FINALISTS = 20;
static MAX_PER_CATEGORY = 5;

// Quality Thresholds
static PREFERRED_QUALITY_THRESHOLD = 0.5;
static MINIMUM_QUALITY_THRESHOLD = 0.3;
static RERANK_MIN_FINALISTS = 3;

// Diagnostics
static DIAGNOSTICS_ENABLED = false;
static AUTHOR_SPLIT_REGEX = /[,;]/;
static SHOW_MORE_REGEX = /\b(näita\s+rohkem|show\s+more|veel|more)\b/i;
}

Response Configuration

File: app/chat/config.ts

export const chatConfig = {
productDescriptions: {
maxWords: 250, // Max words per response
sentencesPerProduct: 3, // Sentences per product description
}
}

Data Flow Examples

Example 1: Show More Products

Example 2: Vague Query with Clarification

Example 3: Author Clarification Workflow


Appendix: Key Files

Orchestrators

  • app/api/chat/orchestrators/parallel-orchestrator.ts - Optimized flow
  • app/api/chat/orchestrators/context-orchestrator/orchestrate.ts - Context extraction
  • app/api/chat/orchestrators/search-orchestrator.ts - Search pipeline
  • app/api/chat/orchestrators/response-orchestrator.ts - AI response generation

Handlers

  • app/api/chat/handlers/handler-router.ts - Intent-based routing
  • app/api/chat/handlers/product-search-handler.ts - Product search flow
  • app/api/chat/handlers/clarifying-question-handler.ts - Clarification flow
  • app/api/chat/handlers/conversational-handler.ts - Conversational flow

Services

  • app/api/chat/services/query-rewriting/ - Query generation
  • app/api/chat/services/product-search.ts - Multi-search execution
  • app/api/chat/services/funnel.ts - Multi-stage filtering
  • app/api/chat/services/rerank.ts - Semantic reranking
  • app/api/chat/services/diversity.ts - Final selection
  • app/api/chat/services/language.ts - Estonian boost

Configuration

  • app/api/chat/orchestrators/search-orchestrator.config.ts - Search config
  • app/chat/config.ts - Response config

Glossary

TermDefinition
TTFCTime To First Chunk - Time until user sees first AI response
Context OrchestrationExtract and manage conversation state
Search OrchestrationMulti-phase product search pipeline
Response OrchestrationAI response generation and streaming
Parallel ExecutionNon-blocking context extraction with immediate feedback
Sequential ExecutionBlocking context extraction before streaming
FunnelMulti-stage candidate filtering (Stage A → B → C)
RerankingLLM-based semantic scoring for relevance
Diversity SelectionCategory and price distribution balancing
SkeletonEmpty product card placeholders for instant feedback
Delayed CardsProduct injection after AI text starts streaming
Context PreservationTaxonomy inheritance for follow-up queries
Exclude ListPreviously shown product IDs to avoid duplicates
Smart SuggestionsCategory buttons for quick navigation

Last Updated: 2025-11-16
Version: 1.0
Maintainer: AI Orchestration Team