Models & AI Configuration
This document describes which AI models are used at each stage of the pipeline and why they were chosen.
Frontend Pipeline Models
Context Fast-Path & Main Extraction
Model: llama-4-scout-17b-16e-instruct via Groq
Location:
context-understanding/config.tscontext-understanding/index.ts:255-330, 620-700
Purpose:
- Extract intent from user queries
- Parse taxonomy (product type, category)
- Extract budget constraints
- Detect conversation context
Selection Rationale:
- Speed: <1s TTFT (Time to First Token) for responsive UX
- Cost: Lower cost than GPT-4 class models
- Reliability: Consistently outputs valid JSON with structured extraction
- Testing: Supports seeded deterministic mode for reproducible tests
Configuration:
{
model: 'llama-4-scout-17b-16e-instruct',
temperature: 0.1,
response_format: { type: 'json_object' },
seed: process.env.NODE_ENV === 'test' ? 42 : undefined
}
Response Generation
Model: gpt-5.1-chat-latest
Location:
app/api/chat/orchestrators/response-orchestrator.tsapp/api/chat/services/ai-response.tshandlers/query-refinement-handler.ts
Purpose:
- Generate fluent multilingual responses
- Format numbered product recommendations
- Handle delayed product card insertion
- Maintain conversation tone
Selection Rationale:
- Quality: Superior language generation and formatting
- Multilingual: Native Estonian and English support
- Reasoning: Better at following complex formatting rules
- Priority: Uses "high" priority tier to minimize latency
Configuration:
{
model: 'gpt-5.1-chat-latest',
temperature: 0.7,
max_tokens: 2500,
priority: 'high',
stream: true
}
Pipeline Metrics Label
Label: openai/gpt-oss-20b
Location: pipeline.ts:934-944
Purpose:
- Tag frontend-observed latencies in metrics
- Distinguish frontend vs backend timings
- Historical tracking and analysis
Note: This is a reporting label only and does not drive actual model selection. Real generation uses GPT-5.1 as described above.
Backend Pipeline Models
Context Detection (Phase 0)
Model: llama-4-scout-17b-16e-instruct via Groq
Location:
context-understanding/config.tscontext-understanding/index.ts:255-330, 620-700
Purpose:
- Full context extraction with conversation history
- Intent classification
- Taxonomy and constraint parsing
- Budget detection
Selection Rationale:
- Consistency: Same model as fast-path for predictable behavior
- Fast classifier integration: Shares caching layer
- Cost-effective: Handles high volume at reasonable cost
Query Rewriting & Multi-Search (Phase 1-2)
Approach: Deterministic/Rule-based
Location:
services/query-rewritingservices/product-search
Purpose:
- Generate query variations
- Execute parallel Convex searches
- Apply filters and constraints
Selection Rationale:
- Predictability: No LLM variability in search queries
- Latency: Faster than LLM-based rewriting
- Reliability: Avoids query drift and hallucination
- Cost: Zero LLM costs for this phase
Semantic Rerank (Phase 3)
Model: llama-4-scout-17b-16e-instruct via Groq
Location: services/rerank.ts:27-214, 219-266
Purpose:
- Score funnel finalists for gift-fit
- Semantic relevance ranking
- Context-aware filtering
Selection Rationale:
- Fast reasoning: Quick scoring of 10-20 candidates
- Cost-efficient: Cheaper than Cohere Rerank API at scale
- Context-aware: Can consider full gift context in scoring
- Streaming: Can process results as they arrive
Alternative Considered:
- Cohere Rerank v3.5: Better quality but higher cost and latency
Diversity Selection (Phase 4)
Approach: Heuristic (no model)
Location: services/diversity.ts
Purpose:
- Enforce category/price distribution
- Apply strict gift-card handling rules
- Balance product variety
Selection Rationale:
- Deterministic: Consistent results every time
- Fast: No model inference overhead
- Controlled: Avoids LLM variability in final selection
- Rule-based: Easy to debug and adjust
Response Generation (Phase 5)
Model: gpt-5.1-chat-latest (see Frontend section)
Purpose:
- Narrate search results
- Follow language-specific rules
- Format product descriptions
Model Comparison Table
| Phase | Model | Provider | Latency | Cost | Rationale |
|---|---|---|---|---|---|
| Context Extraction | LLaMA 4 Scout 17B | Groq | fast | Low | Fast, structured JSON |
| Query Rewriting | Rule-based | - | <50ms | Zero | Predictable, no drift |
| Search | Convex | - | ~200ms | Low | Native DB search |
| Rerank | LLaMA 4 Scout 17B | Groq | ~150ms | Low | Fast scoring |
| Diversity | Heuristic | - | <10ms | Zero | Deterministic rules |
| Generation | GPT-5.1 | OpenAI | very fast | Medium | High quality output |
Configuration Management
All model configurations are centralized in:
app/api/chat/
├── config/
│ ├── models.ts # Model selection logic
│ └── prompts.ts # Prompt templates
├── services/
│ ├── context-understanding/
│ │ └── config.ts # LLaMA config
│ └── ai-response.ts # GPT config
Environment Variables
# API Keys
GROQ_API_KEY=gsk_... # LLaMA 4 Scout access
OPENAI_API_KEY=sk-... # GPT-5.1 access
# Model Overrides (optional)
CONTEXT_MODEL=llama-4-scout-17b-16e-instruct
GENERATION_MODEL=gpt-5.1-chat-latest
# Performance Tuning
CONTEXT_TIMEOUT_MS=2000
GENERATION_MAX_TOKENS=2500
Cost Optimization
Estimated costs per 1000 requests:
| Component | Model | Cost |
|---|---|---|
| Context extraction | LLaMA 4 Scout | $0.50 |
| Reranking | LLaMA 4 Scout | $0.30 |
| Generation | GPT-5.1 | $2.00 |
| Total | $2.80 |
Future Considerations
Model Upgrade Path
- GPT-5 Turbo - When available, for even faster generation
- Custom Fine-tuned Model - For gift-specific context extraction
- Cohere Rerank - If cost becomes acceptable at scale
A/B Testing
Track these metrics when evaluating new models:
- TTFC (Time to First Chunk)
- Context extraction accuracy
- User satisfaction scores
- Cost per request
- Error rates
Related Documentation
- Lifecycle & Flow - When each model is invoked
- Event Handling - How model outputs are streamed