Models & AI Configuration

This document describes which AI models are used at each stage of the pipeline and why they were chosen.

Frontend Pipeline Models

Context Fast-Path & Main Extraction

Model: llama-4-scout-17b-16e-instruct via Groq

Location:

context-understanding/config.ts
context-understanding/index.ts:255-330, 620-700

Purpose:

Extract intent from user queries
Parse taxonomy (product type, category)
Extract budget constraints
Detect conversation context

Selection Rationale:

Speed: <1s TTFT (Time to First Token) for responsive UX
Cost: Lower cost than GPT-4 class models
Reliability: Consistently outputs valid JSON with structured extraction
Testing: Supports seeded deterministic mode for reproducible tests

Configuration:

{
  model: 'llama-4-scout-17b-16e-instruct',
  temperature: 0.1,
  response_format: { type: 'json_object' },
  seed: process.env.NODE_ENV === 'test' ? 42 : undefined
}

Response Generation

Model: gpt-5.1-chat-latest

Location:

app/api/chat/orchestrators/response-orchestrator.ts
app/api/chat/services/ai-response.ts
handlers/query-refinement-handler.ts

Purpose:

Generate fluent multilingual responses
Format numbered product recommendations
Handle delayed product card insertion
Maintain conversation tone

Selection Rationale:

Quality: Superior language generation and formatting
Multilingual: Native Estonian and English support
Reasoning: Better at following complex formatting rules
Priority: Uses "high" priority tier to minimize latency

Configuration:

{
  model: 'gpt-5.1-chat-latest',
  temperature: 0.7,
  max_tokens: 2500,
  priority: 'high',
  stream: true
}

Pipeline Metrics Label

Label: openai/gpt-oss-20b

Location: pipeline.ts:934-944

Purpose:

Tag frontend-observed latencies in metrics
Distinguish frontend vs backend timings
Historical tracking and analysis

Note: This is a reporting label only and does not drive actual model selection. Real generation uses GPT-5.1 as described above.

Backend Pipeline Models

Context Detection (Phase 0)

Model: llama-4-scout-17b-16e-instruct via Groq

Location:

context-understanding/config.ts
context-understanding/index.ts:255-330, 620-700

Purpose:

Full context extraction with conversation history
Intent classification
Taxonomy and constraint parsing
Budget detection

Selection Rationale:

Consistency: Same model as fast-path for predictable behavior
Fast classifier integration: Shares caching layer
Cost-effective: Handles high volume at reasonable cost

Query Rewriting & Multi-Search (Phase 1-2)

Approach: Deterministic/Rule-based

Location:

services/query-rewriting
services/product-search

Purpose:

Generate query variations
Execute parallel Convex searches
Apply filters and constraints

Selection Rationale:

Predictability: No LLM variability in search queries
Latency: Faster than LLM-based rewriting
Reliability: Avoids query drift and hallucination
Cost: Zero LLM costs for this phase

Semantic Rerank (Phase 3)

Model: llama-4-scout-17b-16e-instruct via Groq

Location: services/rerank.ts:27-214, 219-266

Purpose:

Score funnel finalists for gift-fit
Semantic relevance ranking
Context-aware filtering

Selection Rationale:

Fast reasoning: Quick scoring of 10-20 candidates
Cost-efficient: Cheaper than Cohere Rerank API at scale
Context-aware: Can consider full gift context in scoring
Streaming: Can process results as they arrive

Alternative Considered:

Cohere Rerank v3.5: Better quality but higher cost and latency

Diversity Selection (Phase 4)

Approach: Heuristic (no model)

Location: services/diversity.ts

Purpose:

Enforce category/price distribution
Apply strict gift-card handling rules
Balance product variety

Selection Rationale:

Deterministic: Consistent results every time
Fast: No model inference overhead
Controlled: Avoids LLM variability in final selection
Rule-based: Easy to debug and adjust

Response Generation (Phase 5)

Model: gpt-5.1-chat-latest (see Frontend section)

Purpose:

Narrate search results
Follow language-specific rules
Format product descriptions

Model Comparison Table

Phase	Model	Provider	Latency	Cost	Rationale
Context Extraction	LLaMA 4 Scout 17B	Groq	fast	Low	Fast, structured JSON
Query Rewriting	Rule-based	-	<50ms	Zero	Predictable, no drift
Search	Convex	-	~200ms	Low	Native DB search
Rerank	LLaMA 4 Scout 17B	Groq	~150ms	Low	Fast scoring
Diversity	Heuristic	-	<10ms	Zero	Deterministic rules
Generation	GPT-5.1	OpenAI	very fast	Medium	High quality output

Configuration Management

All model configurations are centralized in:

app/api/chat/
├── config/
│   ├── models.ts              # Model selection logic
│   └── prompts.ts             # Prompt templates
├── services/
│   ├── context-understanding/
│   │   └── config.ts          # LLaMA config
│   └── ai-response.ts         # GPT config

Environment Variables

# API Keys
GROQ_API_KEY=gsk_...           # LLaMA 4 Scout access
OPENAI_API_KEY=sk-...          # GPT-5.1 access

# Model Overrides (optional)
CONTEXT_MODEL=llama-4-scout-17b-16e-instruct
GENERATION_MODEL=gpt-5.1-chat-latest

# Performance Tuning
CONTEXT_TIMEOUT_MS=2000
GENERATION_MAX_TOKENS=2500

Cost Optimization

Estimated costs per 1000 requests:

Component	Model	Cost
Context extraction	LLaMA 4 Scout	$0.50
Reranking	LLaMA 4 Scout	$0.30
Generation	GPT-5.1	$2.00
Total		$2.80

Future Considerations

Model Upgrade Path

GPT-5 Turbo - When available, for even faster generation
Custom Fine-tuned Model - For gift-specific context extraction
Cohere Rerank - If cost becomes acceptable at scale

A/B Testing

Track these metrics when evaluating new models:

TTFC (Time to First Chunk)
Context extraction accuracy
User satisfaction scores
Cost per request
Error rates

Lifecycle & Flow - When each model is invoked
Event Handling - How model outputs are streamed

Frontend Pipeline Models​

Context Fast-Path & Main Extraction​

Response Generation​

Pipeline Metrics Label​

Backend Pipeline Models​

Context Detection (Phase 0)​

Query Rewriting & Multi-Search (Phase 1-2)​

Semantic Rerank (Phase 3)​

Diversity Selection (Phase 4)​

Response Generation (Phase 5)​

Model Comparison Table​

Configuration Management​

Environment Variables​

Cost Optimization​

Future Considerations​

Model Upgrade Path​

A/B Testing​

Related Documentation​

Frontend Pipeline Models

Context Fast-Path & Main Extraction

Response Generation

Pipeline Metrics Label

Backend Pipeline Models

Context Detection (Phase 0)

Query Rewriting & Multi-Search (Phase 1-2)

Semantic Rerank (Phase 3)

Diversity Selection (Phase 4)

Response Generation (Phase 5)

Model Comparison Table

Configuration Management

Environment Variables

Cost Optimization

Future Considerations

Model Upgrade Path

A/B Testing

Related Documentation