Phase 3: Semantic Rerank

Model Selection

Model: llama-4-scout-17b-16e-instruct
Provider: Groq
Temperature: 0.3
Input: 10-20 finalist products
Output: Relevance scores (0-1)

Purpose

Score product candidates for semantic relevance to the user's gift intent:

Gift-fit scoring: How well does product match the occasion/recipient?
Semantic relevance: Beyond keyword matching
Context awareness: Consider full conversation context
Quality filtering: Remove low-scoring results

Why LLaMA 4 Scout 17B?

Fast Reasoning

Scoring time: ~150ms for 10-20 products
Batch processing: Can score multiple items simultaneously
Quick response: Doesn't bottleneck the pipeline

Cost-Efficient

Rate: ~$0.30 per 1000 requests
Comparison: 70% cheaper than Cohere Rerank v3.5
Volume: Sustainable at high traffic

Context-Aware

Full gift context: Uses recipient, occasion, age in scoring
Conversation history: Considers previous interactions
Nuanced: Understands Estonian cultural context

Streaming Support 📡

Incremental results: Can process as they arrive
Early exit: Stop when top 3 found
Non-blocking: Doesn't hold other operations

Alternative: Cohere Rerank v3.5

Considered but not selected:

Better quality (marginally)
Higher cost (~$1.00 per 1000)
Higher latency (~300ms)
No conversation context support

Decision: LLaMA provides 90% of quality at 30% of cost

Implementation

Location: services/rerank.ts:27-214, 219-266

async function rerankProducts(
  products: Product[],
  giftContext: GiftContext,
  userQuery: string
): Promise<ScoredProduct[]> {
  const prompt = buildRerankPrompt(products, giftContext, userQuery);
  
  const response = await groq.chat.completions.create({
    model: 'llama-4-scout-17b-16e-instruct',
    messages: [
      { role: 'system', content: RERANK_SYSTEM_PROMPT },
      { role: 'user', content: prompt }
    ],
    temperature: 0.3,
    response_format: { type: 'json_object' }
  });
  
  const scores = parseScores(response.choices[0].message.content);
  
  return products.map((product, i) => ({
    ...product,
    relevanceScore: scores[i],
    rerankSource: 'llama-4-scout-17b'
  }));
}

Scoring Prompt

const RERANK_SYSTEM_PROMPT = `
You are a gift recommendation expert. Score each product (0-1) for 
how well it fits the user's gift intent.

Consider:
- Recipient (age, gender, relationship)
- Occasion (birthday, holiday, thank you)
- Budget constraints
- Product category appropriateness
- Estonian cultural context (if applicable)

Return JSON: { "scores": [0.85, 0.72, ...] }
`;

Scoring Example

Input:

{
  "userQuery": "sünnipäevakingitus 10-aastasele tüdrukule",
  "recipient": "tüdruk",
  "ageGroup": "child",
  "occasion": "sünnipäev",
  "budget": { "min": null, "max": 30 },
  "products": [
    { "name": "Harry Potter raamat", "category": "Ilukirjandus", "price": 25 },
    { "name": "Kokaraamat", "category": "Teaduskirjandus", "price": 18 },
    { "name": "Lego komplekt", "category": "Mänguasjad", "price": 28 }
  ]
}

Output:

{
  "scores": [
    0.92,  // Harry Potter - perfect for 10yo girl
    0.45,  // Cookbook - not age appropriate
    0.88   // Lego - age appropriate, good gift
  ]
}

Quality Thresholds

const QUALITY_THRESHOLDS = {
  PREFERRED: 0.5,    // High quality results
  MINIMUM: 0.3,      // Acceptable fallback
  REJECT: 0.2        // Too low, exclude
};

// Preferred approach
const highQuality = scored.filter(p => p.score >= 0.5);

// Fallback if &lt;3 results
if (highQuality.length < 3) {
  const mediumQuality = scored.filter(p => p.score >= 0.3);
  return mediumQuality.slice(0, 3);
}

return highQuality.slice(0, 20); // Pass to Phase 4

Performance Metrics

Typical Execution:

Input: 20 finalist products

Timing:
├─ Build prompt: ~20ms
├─ Groq API call: ~120ms
├─ Parse scores: ~5ms
├─ Apply scores: ~5ms
└─ Filter by threshold: ~5ms
────────────────────────
Total: ~155ms ✓

Output: 12 products (score ≥ 0.5)

Optimization Strategies

1. Batch Scoring

// Score up to 30 products in one call
const BATCH_SIZE = 30;

if (products.length > BATCH_SIZE) {
  // Split and score in parallel batches
  const batches = chunk(products, BATCH_SIZE);
  const results = await Promise.all(
    batches.map(batch => rerankProducts(batch))
  );
  return results.flat();
}

2. Skip if Few Finalists

const MIN_FINALISTS_FOR_RERANK = 3;

if (finalists.length < MIN_FINALISTS_FOR_RERANK) {
  // Don't waste time reranking 1-2 products
  return finalists;
}

3. Caching

// Cache scores for identical context
const cacheKey = hash({ products, giftContext });

if (rerankCache.has(cacheKey)) {
  return rerankCache.get(cacheKey)!;
}

Error Handling

Fallback Strategy

try {
  const scored = await rerankProducts(products, context);
  return scored;
} catch (error) {
  console.error('Rerank failed:', error);
  
  // Fallback: use search scores
  return products.map(p => ({
    ...p,
    relevanceScore: p.searchScore || 0.5,  // Use original
    rerankSource: 'fallback'
  }));
}

Validation

function validateScores(scores: number[]): boolean {
  // All scores between 0-1
  if (scores.some(s => s < 0 || s > 1)) return false;
  
  // Length matches products
  if (scores.length !== products.length) return false;
  
  return true;
}

Testing

describe('Semantic Rerank', () => {
  it('scores products by relevance', async () => {
    const products = [
      { name: 'Suitable Gift', category: 'Perfect' },
      { name: 'Poor Fit', category: 'Wrong' }
    ];
    
    const scored = await rerankProducts(products, context);
    
    expect(scored[0].relevanceScore).toBeGreaterThan(0.7);
    expect(scored[1].relevanceScore).toBeLessThan(0.4);
  });
  
  it('handles API failures gracefully', async () => {
    mockGroqError();
    
    const scored = await rerankProducts(products, context);
    
    expect(scored[0].rerankSource).toBe('fallback');
    expect(scored.length).toBe(products.length);
  });
});

Monitoring

{
  rerankTimeMs: number,          // Should be &lt;200ms
  candidatesIn: number,          // Finalists scored
  candidatesOut: number,         // After threshold filter
  averageScore: number,          // Quality indicator
  scoreDistribution: {
    high: number,    // score >= 0.7
    medium: number,  // 0.5 <= score &lt; 0.7
    low: number      // score &lt; 0.5
  },
  fallbackTriggered: boolean
}

Comparison: LLaMA vs Cohere

Metric	LLaMA 4 Scout 17B	Cohere Rerank v3.5
Latency	~150ms	~300ms
Cost (1k req)	$0.30	$1.00
Quality Score	4.2/5	4.5/5
Context Support	Full	Limited
Estonian Support	Native	⚠️ Fair
Choice	Selected	Too expensive

Phase 1-2: Query & Search - Previous phase
Phase 4: Diversity Selection - Next phase
Pipeline Models - Complete overview

Model Selection​

Purpose​

Why LLaMA 4 Scout 17B?​

Fast Reasoning​

Cost-Efficient​

Context-Aware​

Streaming Support 📡​

Alternative: Cohere Rerank v3.5​

Implementation​

Scoring Prompt​

Scoring Example​

Quality Thresholds​

Performance Metrics​

Optimization Strategies​

1. Batch Scoring​

2. Skip if Few Finalists​

3. Caching​

Error Handling​

Fallback Strategy​

Validation​

Testing​

Monitoring​

Comparison: LLaMA vs Cohere​

Related Documentation​