Skip to main content

Models & AI Configuration

This document describes which AI models are used at each stage of the pipeline and why they were chosen.

Frontend Pipeline Models

Context Fast-Path & Main Extraction

Model: llama-4-scout-17b-16e-instruct via Groq

Location:

  • context-understanding/config.ts
  • context-understanding/index.ts:255-330, 620-700

Purpose:

  • Extract intent from user queries
  • Parse taxonomy (product type, category)
  • Extract budget constraints
  • Detect conversation context

Selection Rationale:

  • Speed: <1s TTFT (Time to First Token) for responsive UX
  • Cost: Lower cost than GPT-4 class models
  • Reliability: Consistently outputs valid JSON with structured extraction
  • Testing: Supports seeded deterministic mode for reproducible tests

Configuration:

{
model: 'llama-4-scout-17b-16e-instruct',
temperature: 0.1,
response_format: { type: 'json_object' },
seed: process.env.NODE_ENV === 'test' ? 42 : undefined
}

Response Generation

Model: gpt-5.1-chat-latest

Location:

  • app/api/chat/orchestrators/response-orchestrator.ts
  • app/api/chat/services/ai-response.ts
  • handlers/query-refinement-handler.ts

Purpose:

  • Generate fluent multilingual responses
  • Format numbered product recommendations
  • Handle delayed product card insertion
  • Maintain conversation tone

Selection Rationale:

  • Quality: Superior language generation and formatting
  • Multilingual: Native Estonian and English support
  • Reasoning: Better at following complex formatting rules
  • Priority: Uses "high" priority tier to minimize latency

Configuration:

{
model: 'gpt-5.1-chat-latest',
temperature: 0.7,
max_tokens: 2500,
priority: 'high',
stream: true
}

Pipeline Metrics Label

Label: openai/gpt-oss-20b

Location: pipeline.ts:934-944

Purpose:

  • Tag frontend-observed latencies in metrics
  • Distinguish frontend vs backend timings
  • Historical tracking and analysis

Note: This is a reporting label only and does not drive actual model selection. Real generation uses GPT-5.1 as described above.

Backend Pipeline Models

Context Detection (Phase 0)

Model: llama-4-scout-17b-16e-instruct via Groq

Location:

  • context-understanding/config.ts
  • context-understanding/index.ts:255-330, 620-700

Purpose:

  • Full context extraction with conversation history
  • Intent classification
  • Taxonomy and constraint parsing
  • Budget detection

Selection Rationale:

  • Consistency: Same model as fast-path for predictable behavior
  • Fast classifier integration: Shares caching layer
  • Cost-effective: Handles high volume at reasonable cost

Query Rewriting & Multi-Search (Phase 1-2)

Approach: Deterministic/Rule-based

Location:

  • services/query-rewriting
  • services/product-search

Purpose:

  • Generate query variations
  • Execute parallel Convex searches
  • Apply filters and constraints

Selection Rationale:

  • Predictability: No LLM variability in search queries
  • Latency: Faster than LLM-based rewriting
  • Reliability: Avoids query drift and hallucination
  • Cost: Zero LLM costs for this phase

Semantic Rerank (Phase 3)

Model: llama-4-scout-17b-16e-instruct via Groq

Location: services/rerank.ts:27-214, 219-266

Purpose:

  • Score funnel finalists for gift-fit
  • Semantic relevance ranking
  • Context-aware filtering

Selection Rationale:

  • Fast reasoning: Quick scoring of 10-20 candidates
  • Cost-efficient: Cheaper than Cohere Rerank API at scale
  • Context-aware: Can consider full gift context in scoring
  • Streaming: Can process results as they arrive

Alternative Considered:

  • Cohere Rerank v3.5: Better quality but higher cost and latency

Diversity Selection (Phase 4)

Approach: Heuristic (no model)

Location: services/diversity.ts

Purpose:

  • Enforce category/price distribution
  • Apply strict gift-card handling rules
  • Balance product variety

Selection Rationale:

  • Deterministic: Consistent results every time
  • Fast: No model inference overhead
  • Controlled: Avoids LLM variability in final selection
  • Rule-based: Easy to debug and adjust

Response Generation (Phase 5)

Model: gpt-5.1-chat-latest (see Frontend section)

Purpose:

  • Narrate search results
  • Follow language-specific rules
  • Format product descriptions

Model Comparison Table

PhaseModelProviderLatencyCostRationale
Context ExtractionLLaMA 4 Scout 17BGroqfastLowFast, structured JSON
Query RewritingRule-based-<50msZeroPredictable, no drift
SearchConvex-~200msLowNative DB search
RerankLLaMA 4 Scout 17BGroq~150msLowFast scoring
DiversityHeuristic-<10msZeroDeterministic rules
GenerationGPT-5.1OpenAIvery fastMediumHigh quality output

Configuration Management

All model configurations are centralized in:

app/api/chat/
├── config/
│ ├── models.ts # Model selection logic
│ └── prompts.ts # Prompt templates
├── services/
│ ├── context-understanding/
│ │ └── config.ts # LLaMA config
│ └── ai-response.ts # GPT config

Environment Variables

# API Keys
GROQ_API_KEY=gsk_... # LLaMA 4 Scout access
OPENAI_API_KEY=sk-... # GPT-5.1 access

# Model Overrides (optional)
CONTEXT_MODEL=llama-4-scout-17b-16e-instruct
GENERATION_MODEL=gpt-5.1-chat-latest

# Performance Tuning
CONTEXT_TIMEOUT_MS=2000
GENERATION_MAX_TOKENS=2500

Cost Optimization

Estimated costs per 1000 requests:

ComponentModelCost
Context extractionLLaMA 4 Scout$0.50
RerankingLLaMA 4 Scout$0.30
GenerationGPT-5.1$2.00
Total$2.80

Future Considerations

Model Upgrade Path

  1. GPT-5 Turbo - When available, for even faster generation
  2. Custom Fine-tuned Model - For gift-specific context extraction
  3. Cohere Rerank - If cost becomes acceptable at scale

A/B Testing

Track these metrics when evaluating new models:

  • TTFC (Time to First Chunk)
  • Context extraction accuracy
  • User satisfaction scores
  • Cost per request
  • Error rates