Phase 5: Response Generation

Model Selection

Model: gpt-5.1-chat-latest
Provider: OpenAI
Temperature: 0.7
Max Tokens: 2500
Priority: High
Streaming: Enabled

Purpose

Generate natural, engaging responses that:

Narrate search results: Explain why products are good fits
Follow language rules: Native Estonian and English
Format recommendations: Numbered lists, delayed product cards
Maintain conversation tone: Friendly, helpful, professional
Handle edge cases: Pool exhaustion, clarifications, errors

Why GPT-5.1?

Quality

Language Generation: Superior fluency in both ET and EN
Complex Instructions: Follows detailed formatting rules
Reasoning: Understands gift-giving context nuances
Creativity: Natural, non-robotic responses

Multilingual

Native Estonian: Excellent grammar and idioms
Native English: Professional tone
Code-switching: Handles mixed language gracefully
Cultural Context: Understands Estonian customs

Reasoning Power

Gift Appropriateness: Explains why product fits
Comparative Analysis: Highlights differences between products
Tone Matching: Adapts to user's communication style
Edge Case Handling: Generates helpful clarifications

Priority Tier

High Priority: Minimizes queueing
Faster TTFB: First token arrives quickly
Low Latency: very fast to first chunk
Streaming: Immediate user feedback

Configuration

Location: app/api/chat/orchestrators/response-orchestrator.ts

export const GENERATION_CONFIG = {
  model: 'gpt-5.1-chat-latest',
  temperature: 0.7,
  max_tokens: 2500,
  stream: true,
  priority: 'high',
  
  // Formatting rules
  delayedCardInjection: true,
  cardInjectionThreshold: 180,  // chars before injecting
  
  // Safety
  repetitionDetection: true,
  tokenLimitWarning: 0.9        // Warn at 90% usage
};

Implementation

Location: app/api/chat/services/ai-response.ts

async function* generateResponse(
  products: Product[],
  giftContext: GiftContext,
  conversationHistory: Message[]
): AsyncGenerator<string> {
  const systemPrompt = buildSystemPrompt(giftContext);
  const userPrompt = buildUserPrompt(products, giftContext);
  
  const stream = await openai.chat.completions.create({
    model: GENERATION_CONFIG.model,
    messages: [
      { role: 'system', content: systemPrompt },
      ...conversationHistory,
      { role: 'user', content: userPrompt }
    ],
    temperature: GENERATION_CONFIG.temperature,
    max_tokens: GENERATION_CONFIG.max_tokens,
    stream: true,
    stream_options: {
      priority: 'high'
    }
  });
  
  let accumulated = '';
  let productCardsInjected = false;
  
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    accumulated += content;
    
    // Delayed card injection
    if (!productCardsInjected && 
        accumulated.length >= GENERATION_CONFIG.cardInjectionThreshold) {
      yield { type: 'inject-cards', products };
      productCardsInjected = true;
    }
    
    yield { type: 'text', content };
  }
}

Response Patterns

Product Recommendation

const PRODUCT_RECOMMENDATION_TEMPLATE = `
Based on your request for {occasion} gifts for {recipient}, 
I recommend these {count} products:

1. **{product1}** - {reason1}
2. **{product2}** - {reason2}
3. **{product3}** - {reason3}

{additional_context}
`;

Example Output (Estonian):

Sinu sünnipäevakingituse otsingu põhjal soovitan järgmisi 
raamatuid 10-aastasele tüdrukule:

1. **Harry Potter ja tarkade kivi** - Põnev fantaasiaraamat, 
   mis sobib suurepäraselt selle vanuse lastele.

2. **Matilda** - Roald Dahli klassika, mis õpetab lugema 
   armastama.

3. **Anne Shirley** - Seikluslik lugu tugevast tüdrukust, 
   inspireeriv lugemine.

Kõik need mahuvad sinu 30-eurosesse eelarvesse!

Clarifying Question

const CLARIFICATION_TEMPLATE = `
{polite_acknowledgment}

{clarification_question}

{helpful_suggestions}
`;

Example Output:

Tere! Meeleldi aitan kingituse leidmisel.

Kellele kingitust otsid?
- Perele
- Sõbrale
- Kolleegile
- Lapsele

Või kirjelda täpsemalt, mida otsid!

Pool Exhaustion Acknowledgment

const EXHAUSTION_TEMPLATE = `
{transparent_explanation}

{what_was_shown}

{alternative_suggestion}
`;

Example Output:

Oleme näidanud kõik eesti luule raamatud meie valikust!

Näitasin sulle {count} erinevat teost.

Kas soovid näha:
- Muus kategoorias raamatuid?
- Ingliskeelseid luulekogu?
- Teisi eesti kirjanduse žanreid?

Delayed Card Injection

Why 180 characters?

User reads intro first
Products appear as context loads
Smooth UX transition
Avoids flash of content

Quality Controls

1. Repetition Detection

function detectRepetition(text: string): boolean {
  // Consecutive: 3+ same words in a row
  const consecutivePattern = /(\b\w+\b)(\s+\1){2,}/g;
  if (consecutivePattern.test(text)) return true;
  
  // Frequent: 3+ occurrences in 20-word window
  const words = text.split(/\s+/).slice(-20);
  const wordCounts = new Map<string, number>();
  for (const word of words) {
    wordCounts.set(word, (wordCounts.get(word) || 0) + 1);
  }
  
  return Array.from(wordCounts.values()).some(count => count >= 3);
}

2. Token Limit Monitoring

const TOKEN_LIMIT = 2500;
const WARNING_THRESHOLD = 0.9;

if (tokenCount >= TOKEN_LIMIT * WARNING_THRESHOLD) {
  console.warn('Approaching token limit:', {
    used: tokenCount,
    limit: TOKEN_LIMIT,
    remaining: TOKEN_LIMIT - tokenCount
  });
}

3. Product Mention Validation

function validateProductMentions(
  response: string,
  products: Product[]
): boolean {
  // Check if response mentions the products
  const mentioned = products.filter(p =>
    response.toLowerCase().includes(p.name.toLowerCase())
  );
  
  if (mentioned.length === 0) {
    console.warn('Response doesn\'t mention any products');
    return false;
  }
  
  return true;
}

Performance Metrics

Typical Execution:

Timing:
├─ Build system prompt: ~5ms
├─ Build user prompt: ~3ms
├─ OpenAI API TTFB: ~80ms
├─ First chunk arrives: very fast ✓
├─ Streaming (200 tokens): ~1500ms
├─ Card injection: ~1ms
├─ Stream complete: ~2000ms
└─ Validation: ~5ms
───────────────────────────
Total: ~2100ms

Error Handling

Fallback Response

const FALLBACK_RESPONSES = {
  et: "Vabandust, ei suutnud vastust genereerida. Palun proovi uuesti.",
  en: "Sorry, I couldn't generate a response. Please try again."
};

try {
  return await generateResponse(products, context);
} catch (error) {
  const lang = context.language || 'et';
  return FALLBACK_RESPONSES[lang];
}

Cut-off Handling

if (finishReason === 'length') {
  // Token limit reached
  return response + '...'; // Graceful ellipsis
}

Testing

describe('Response Generation', () => {
  it('generates Estonian response', async () => {
    const products = mockProducts(3);
    const context = { language: 'et', recipient: 'ema' };
    
    const response = await generateResponse(products, context);
    
    expect(response).toContain('soovitan');
    expect(response.length).toBeGreaterThan(100);
  });
  
  it('mentions all products', async () => {
    const products = mockProducts(3);
    const response = await generateResponse(products, context);
    
    products.forEach(p => {
      expect(response.toLowerCase()).toContain(
        p.name.toLowerCase()
      );
    });
  });
  
  it('respects token limit', async () => {
    const response = await generateResponse(products, context);
    const tokens = estimateTokens(response);
    
    expect(tokens).toBeLessThan(2500);
  });
});

Monitoring

{
  generationTimeMs: number,      // Should be ~2000ms
  tokenCount: number,            // Should be &lt;2500
  firstChunkMs: number,          // Should be &lt;150ms
  repetitionDetected: boolean,   // Should be &lt;1%
  cutoffOccurred: boolean,      // Should be &lt;5%
  productsMentioned: number,    // Should equal products.length
  
  // Quality
  responseLength: number,
  languageMatched: boolean,     // Response in correct language
  formatValid: boolean          // Proper structure
}

Cost Optimization

Strategies:

Smart Prompts: Concise system prompts save tokens
History Pruning: Only include last 3 messages
Early Stopping: Stop at 200 tokens if sufficient
Caching: System prompt caching (when available)

Token Breakdown:

System Prompt: ~400 tokens
Conversation History: ~300 tokens
User Prompt: ~150 tokens
Product Context: ~200 tokens
─────────────────────────
Input: ~1050 tokens

Generated Response: ~200 tokens
─────────────────────────
Total: ~1250 tokens @ $0.002 = $0.0025 per request

Phase 4: Diversity Selection - Previous phase
Phase 0: Context Detection - First phase
Pipeline Models - Complete overview

Model Selection​

Purpose​

Why GPT-5.1?​

Quality​

Multilingual​

Reasoning Power​

Priority Tier​

Configuration​

Implementation​

Response Patterns​

Product Recommendation​

Clarifying Question​

Pool Exhaustion Acknowledgment​

Delayed Card Injection​

Quality Controls​

1. Repetition Detection​

2. Token Limit Monitoring​

3. Product Mention Validation​

Performance Metrics​

Error Handling​

Fallback Response​

Cut-off Handling​

Testing​

Monitoring​

Cost Optimization​

Related Documentation​