Groq Models Performance Comparison
Comprehensive benchmark analysis of 10 Groq models tested with production RahvaRaamat AI system prompt.
Test Date: November 16, 2025
Models Tested: 10 (100% completion rate)
Test Types: Estonian & English queries (Simple & Complex)
** We are currently using: Llama 4 Scout 17B 16e-Instruct**
Model ID: meta-llama/llama-4-scout-17b-16e-instruct
- Performance: 991ms avg latency, 1,346 t/s
- Quality: 3/3 (Estonian , English , Format )
- Consistency: 260ms variance (most consistent)
- Best choice for Estonian production
Top Recommendations
| Category | Model | Latency | Speed | Quality | Best For |
|---|---|---|---|---|---|
| ⚠️ NOT RECOMMENDED - Estonian issues | |||||
| ⚠️ NOT RECOMMENDED - Malformed JSON | |||||
| RECOMMENDED | Llama 4 Scout 17B | 991ms | 1,346 t/s | 3/3 | General-purpose AI assistant |
| HIGH QUALITY | Llama 3.3 70B Versatile | 1,782ms | 823 t/s | 3/3 | Complex reasoning, detailed responses |
Complete Model Rankings
Speed Rankings (by Average Latency)
| Rank | Model | Params | Owner | Avg Latency | Min | Max | Speed (t/s) |
|---|---|---|---|---|---|---|---|
| 1 | 7B | SDAIA | 717ms | 529ms | 1,004ms | 2,315 | |
| 2 | 8B | Meta | 923ms | 598ms | 1,348ms | 1,556 | |
| 3 | Llama 4 Scout 17B | 17B | Meta | 991ms | 882ms | 1,142ms | 1,346 |
| 4 | GPT-OSS 20B | 20B | OpenAI | 1,757ms | 860ms | 2,234ms | 1,233 |
| 5 | Llama 3.3 70B Versatile | 70B | Meta | 1,782ms | 1,296ms | 2,306ms | 823 |
| 6 | Kimi K2 Instruct | N/A | Moonshot AI | 1,843ms | 1,150ms | 2,379ms | 768 |
| 7 | Kimi K2 Instruct 0905 | N/A | Moonshot AI | 1,930ms | 1,360ms | 2,817ms | 752 |
| 8 | GPT-OSS 120B | 120B | OpenAI | 2,078ms | 1,665ms | 2,349ms | 931 |
| 9 | Llama 4 Maverick 17B | 17B | Meta | 2,295ms | 1,719ms | 3,536ms | 595 |
| 10 | Qwen3 32B | 32B | Alibaba | 2,849ms | 1,642ms | 3,873ms | 731 |
Quality Rankings
Quality Criteria:
- Estonian coherence with special characters (õ, ä, ö, ü)
- English coherence with relevant vocabulary
- Follows format (numbered list with bold titles)
| Model | Quality Score | Estonian | English | Format | Avg Length | Status |
|---|---|---|---|---|---|---|
| ⚠️ | 2,174 chars | Malformed JSON | ||||
| Llama 3.3 70B Versatile | 3/3 | 2,455 chars | Recommended | |||
| Llama 4 Scout 17B | 3/3 | 2,510 chars | PRIMARY | |||
| Llama 4 Maverick 17B | 3/3 | 2,628 chars | Good | |||
| GPT-OSS 20B | 3/3 | 2,958 chars | Good | |||
| GPT-OSS 120B | 3/3 | 3,191 chars | Good | |||
| Kimi K2 Instruct | 3/3 | 2,635 chars | Good | |||
| Kimi K2 Instruct 0905 | 3/3 | 2,926 chars | Good | |||
| Qwen3 32B | 3/3 | 2,736 chars | Good | |||
| 1,960 chars | Estonian issues |
Language-Specific Performance
Estonian Queries
Simple: "Soovita mulle head krimiromaani"
Complex: "Otsin kingitust 45-aastasele naisele, kes armastab põnevust ja itaalia kultuuri. Eelarve kuni 30 eurot."
| Model | Simple | Complex | Avg | Notes |
|---|---|---|---|---|
| 737ms | 1,004ms | 871ms | Estonian issues | |
| Llama 4 Scout 17B | 933ms | 1,142ms | 1,038ms | RECOMMENDED |
| 1,348ms | 1,047ms | 1,198ms | Malformed JSON | |
| Llama 3.3 70B Versatile | 1,895ms | 2,306ms | 2,101ms | Detailed |
| Llama 4 Maverick 17B | 1,984ms | 1,939ms | 1,962ms | Quality focus |
English Queries
Simple: "Recommend me a good crime novel"
Complex: "Looking for a gift for a 45-year-old woman who loves thrillers and Italian culture. Budget up to 30 euros."
| Model | Simple | Complex | Avg | Notes |
|---|---|---|---|---|
| 598ms | 529ms | 564ms | Estonian issues | |
| 697ms | 598ms | 648ms | Malformed JSON | |
| Llama 4 Scout 17B | 882ms | 1,007ms | 945ms | RECOMMENDED |
| GPT-OSS 20B | 860ms | 1,807ms | 1,334ms | Variable speed |
| Llama 3.3 70B Versatile | 1,630ms | 1,296ms | 1,463ms | Detailed |
Observation: All models perform 30-40% faster on English queries than Estonian queries.
Performance Categories
FAST Models (<1s)
Best for: Real-time applications, chat interfaces, instant recommendations
⚠️ Note: Top 2 fastest models have critical issues. Use Llama 4 Scout 17B instead.
BALANCED Models (1-2s)
Best for: General-purpose AI, complex reasoning, versatile applications
Category Average: 1,979ms latency, 874 t/s
SLOW Models (>2s)
Best for: Long-form content, offline processing
Variability Analysis
Most Consistent:
- Llama 4 Scout 17B: 260ms variance
- Allam 2 7B: 475ms variance
- Llama 3.1 8B: 750ms variance
Most Variable:
- Qwen3 32B: 2,231ms variance
- Llama 4 Maverick: 1,817ms variance
- GPT-OSS 20B: 1,374ms variance
Production Recommendations
Use Case Matrix
| Use Case | Recommended Model | Model ID | Reason |
|---|---|---|---|
| Customer-Facing Chat | Llama 4 Scout 17B | meta-llama/llama-4-scout-17b-16e-instruct | Best for Estonian production |
| Real-Time Assistant | Llama 4 Scout 17B | meta-llama/llama-4-scout-17b-16e-instruct | Fast (991ms) with excellent quality |
| High-Volume Processing | Llama 4 Scout 17B | meta-llama/llama-4-scout-17b-16e-instruct | Reliable and consistent |
| Complex Reasoning | Llama 3.3 70B | llama-3.3-70b-versatile | 70B parameters for detailed analysis |
| Premium Quality | GPT-OSS 120B | openai/gpt-oss-120b | Longest responses (3,191 chars) |
allam-2-7b | Estonian language issues | ||
llama-3.1-8b-instant | Malformed JSON issues |
Implementation Strategies
Option 1: Single Model (Recommended)
Model: llama-4-scout-17b-16e-instruct
Performance: 991ms avg, 1,346 t/s, 3/3 quality, 260ms variance
Why: Best for Estonian production - reliable, fast, excellent quality
Option 2: Dual Model (Performance + Quality)
Fast Path: llama-4-scout-17b-16e-instruct (991ms, 3/3)
Quality Path: llama-3.3-70b-versatile (1,782ms, 3/3)
Logic: Use Scout for simple queries, switch to 70B for complex reasoning
Option 3: Dual Model with Fallback
Primary: llama-4-scout-17b-16e-instruct (991ms, consistent)
Fallback: llama-3.3-70b-versatile (1,782ms, complex queries)
Logic: Use Scout for all queries, switch to 70B only for complex reasoning
Models to Avoid
⚠️ Critical Issues - NOT RECOMMENDED
Allam 2 7B
Problems:
- Poor Estonian language understanding
- Fails to handle Estonian special characters properly (õ, ä, ö, ü)
- Format inconsistency issues
- Not suitable for Estonian production use
Verdict: DO NOT USE despite being fastest (717ms)
Llama 3.1 8B Instant
Problems:
- Produces malformed JSON outputs
- JSON parsing failures in context extraction
- Unreliable structured data extraction
- Breaks pipeline integration
Verdict: DO NOT USE despite good speed (923ms)
⚠️ Performance Issues
- Qwen3 32B: Slow for Estonian (3,699ms - 3,873ms)
- Llama 4 Maverick: High variability, can spike to 3,536ms
Cost-Performance Analysis
Efficiency Score (Tokens per Millisecond)
| Model | t/ms | Speed Category | Recommendation |
|---|---|---|---|
| 3.23 | Ultra-fast | Estonian issues | |
| 1.69 | Fast | Malformed JSON | |
| Llama 4 Scout | 1.36 | Fast | Best choice |
| GPT-OSS 20B | 0.70 | Medium | Good for detailed work |
| Llama 3.3 70B | 0.46 | Medium | Premium option |
| GPT-OSS 120B | 0.45 | Medium | Quality focus |
Final Verdict
Primary Recommendation: Llama 4 Scout 17B
Model ID: meta-llama/llama-4-scout-17b-16e-instruct
Performance:
- Avg Latency: 991ms
- Speed: 1,346 t/s
- Quality: 3/3
- Variance: 260ms (most consistent)
- Best choice for Estonian production
Why Chosen:
- Excellent Estonian language understanding
- Reliable JSON output
- Fast response times (<1s)
- Consistent performance
- 17B parameters for good reasoning
Alternative (Premium Quality): Llama 3.3 70B Versatile
Model ID: meta-llama/llama-4-scout-17b-16e-instruct
Performance:
- Avg Latency: 991ms
- Speed: 1,346 t/s
- Quality: 3/3
- Most consistent performance (260ms variance)
Model ID: llama-3.3-70b-versatile
Performance:
- Avg Latency: 1,782ms
- Speed: 823 t/s
- Quality: 3/3
- Best for complex queries requiring detailed analysis
Why Alternative:
- Excellent Estonian support
- Reliable JSON output
- Best for complex reasoning
- ⚠️ Slower (1.8s vs 1s)
- ⚠️ Higher cost (70B vs 17B)
Performance by Language
Estonian Performance
Insight: Estonian queries average 30-40% slower than English across all models.
Summary Statistics
- 10/10 models tested (100% completion rate)
- 2 models rejected (Estonian issues, malformed JSON)
- 8 viable models for production use
- Current choice: Llama 4 Scout 17B (991ms, 3/3 quality, reliable)
- Usable speed range: 991ms to 2,849ms (among viable models)
- Usable throughput: 595 to 1,346 t/s (among viable models)
Next Steps
1. Production Deployment
Recommended Setup:
- Primary: Llama 4 Scout 17B
- Fallback: Llama 3.3 70B (for complex queries)
- Emergency: GPT-5.1 (if Groq unavailable)
2. Production Monitoring
- Track latency in real usage
- Monitor Estonian language quality
- Watch for JSON parsing errors
- Analyze cost per request
3. Quality Assurance
Critical Checks:
- Validate JSON structure from context extraction
- Test Estonian special characters in responses
- Verify numbered list formatting
- Monitor error rates by model
Related Documentation
- Phase 0: Context Detection - Uses Llama 4 Scout 17B
- Phase 3: Rerank - Uses Llama 4 Scout 17B
- Overview - Model strategy overview