Skip to main content

Phase 3: Semantic Rerank

Model Selection

Model: llama-4-scout-17b-16e-instruct
Provider: Groq
Temperature: 0.3
Input: 10-20 finalist products
Output: Relevance scores (0-1)

Purpose

Score product candidates for semantic relevance to the user's gift intent:

  • Gift-fit scoring: How well does product match the occasion/recipient?
  • Semantic relevance: Beyond keyword matching
  • Context awareness: Consider full conversation context
  • Quality filtering: Remove low-scoring results

Why LLaMA 4 Scout 17B?

Fast Reasoning

  • Scoring time: ~150ms for 10-20 products
  • Batch processing: Can score multiple items simultaneously
  • Quick response: Doesn't bottleneck the pipeline

Cost-Efficient

  • Rate: ~$0.30 per 1000 requests
  • Comparison: 70% cheaper than Cohere Rerank v3.5
  • Volume: Sustainable at high traffic

Context-Aware

  • Full gift context: Uses recipient, occasion, age in scoring
  • Conversation history: Considers previous interactions
  • Nuanced: Understands Estonian cultural context

Streaming Support 📡

  • Incremental results: Can process as they arrive
  • Early exit: Stop when top 3 found
  • Non-blocking: Doesn't hold other operations

Alternative: Cohere Rerank v3.5

Considered but not selected:

  • Better quality (marginally)
  • Higher cost (~$1.00 per 1000)
  • Higher latency (~300ms)
  • No conversation context support

Decision: LLaMA provides 90% of quality at 30% of cost

Implementation

Location: services/rerank.ts:27-214, 219-266

async function rerankProducts(
products: Product[],
giftContext: GiftContext,
userQuery: string
): Promise<ScoredProduct[]> {
const prompt = buildRerankPrompt(products, giftContext, userQuery);

const response = await groq.chat.completions.create({
model: 'llama-4-scout-17b-16e-instruct',
messages: [
{ role: 'system', content: RERANK_SYSTEM_PROMPT },
{ role: 'user', content: prompt }
],
temperature: 0.3,
response_format: { type: 'json_object' }
});

const scores = parseScores(response.choices[0].message.content);

return products.map((product, i) => ({
...product,
relevanceScore: scores[i],
rerankSource: 'llama-4-scout-17b'
}));
}

Scoring Prompt

const RERANK_SYSTEM_PROMPT = `
You are a gift recommendation expert. Score each product (0-1) for
how well it fits the user's gift intent.

Consider:
- Recipient (age, gender, relationship)
- Occasion (birthday, holiday, thank you)
- Budget constraints
- Product category appropriateness
- Estonian cultural context (if applicable)

Return JSON: { "scores": [0.85, 0.72, ...] }
`;

Scoring Example

Input:

{
"userQuery": "sünnipäevakingitus 10-aastasele tüdrukule",
"recipient": "tüdruk",
"ageGroup": "child",
"occasion": "sünnipäev",
"budget": { "min": null, "max": 30 },
"products": [
{ "name": "Harry Potter raamat", "category": "Ilukirjandus", "price": 25 },
{ "name": "Kokaraamat", "category": "Teaduskirjandus", "price": 18 },
{ "name": "Lego komplekt", "category": "Mänguasjad", "price": 28 }
]
}

Output:

{
"scores": [
0.92, // Harry Potter - perfect for 10yo girl
0.45, // Cookbook - not age appropriate
0.88 // Lego - age appropriate, good gift
]
}

Quality Thresholds

const QUALITY_THRESHOLDS = {
PREFERRED: 0.5, // High quality results
MINIMUM: 0.3, // Acceptable fallback
REJECT: 0.2 // Too low, exclude
};

// Preferred approach
const highQuality = scored.filter(p => p.score >= 0.5);

// Fallback if &lt;3 results
if (highQuality.length < 3) {
const mediumQuality = scored.filter(p => p.score >= 0.3);
return mediumQuality.slice(0, 3);
}

return highQuality.slice(0, 20); // Pass to Phase 4

Performance Metrics

Typical Execution:

Input: 20 finalist products

Timing:
├─ Build prompt: ~20ms
├─ Groq API call: ~120ms
├─ Parse scores: ~5ms
├─ Apply scores: ~5ms
└─ Filter by threshold: ~5ms
────────────────────────
Total: ~155ms ✓

Output: 12 products (score ≥ 0.5)

Optimization Strategies

1. Batch Scoring

// Score up to 30 products in one call
const BATCH_SIZE = 30;

if (products.length > BATCH_SIZE) {
// Split and score in parallel batches
const batches = chunk(products, BATCH_SIZE);
const results = await Promise.all(
batches.map(batch => rerankProducts(batch))
);
return results.flat();
}

2. Skip if Few Finalists

const MIN_FINALISTS_FOR_RERANK = 3;

if (finalists.length < MIN_FINALISTS_FOR_RERANK) {
// Don't waste time reranking 1-2 products
return finalists;
}

3. Caching

// Cache scores for identical context
const cacheKey = hash({ products, giftContext });

if (rerankCache.has(cacheKey)) {
return rerankCache.get(cacheKey)!;
}

Error Handling

Fallback Strategy

try {
const scored = await rerankProducts(products, context);
return scored;
} catch (error) {
console.error('Rerank failed:', error);

// Fallback: use search scores
return products.map(p => ({
...p,
relevanceScore: p.searchScore || 0.5, // Use original
rerankSource: 'fallback'
}));
}

Validation

function validateScores(scores: number[]): boolean {
// All scores between 0-1
if (scores.some(s => s < 0 || s > 1)) return false;

// Length matches products
if (scores.length !== products.length) return false;

return true;
}

Testing

describe('Semantic Rerank', () => {
it('scores products by relevance', async () => {
const products = [
{ name: 'Suitable Gift', category: 'Perfect' },
{ name: 'Poor Fit', category: 'Wrong' }
];

const scored = await rerankProducts(products, context);

expect(scored[0].relevanceScore).toBeGreaterThan(0.7);
expect(scored[1].relevanceScore).toBeLessThan(0.4);
});

it('handles API failures gracefully', async () => {
mockGroqError();

const scored = await rerankProducts(products, context);

expect(scored[0].rerankSource).toBe('fallback');
expect(scored.length).toBe(products.length);
});
});

Monitoring

{
rerankTimeMs: number, // Should be &lt;200ms
candidatesIn: number, // Finalists scored
candidatesOut: number, // After threshold filter
averageScore: number, // Quality indicator
scoreDistribution: {
high: number, // score >= 0.7
medium: number, // 0.5 <= score &lt; 0.7
low: number // score &lt; 0.5
},
fallbackTriggered: boolean
}

Comparison: LLaMA vs Cohere

MetricLLaMA 4 Scout 17BCohere Rerank v3.5
Latency~150ms~300ms
Cost (1k req)$0.30$1.00
Quality Score4.2/54.5/5
Context SupportFullLimited
Estonian SupportNative⚠️ Fair
Choice** Selected**Too expensive