Resilience & Error Handling
The pipeline implements multiple layers of resilience to ensure a reliable user experience even when things go wrong.
Timeout Strategies
Request Timeout
Value: 30 seconds
Location: config.ts and pipeline.ts:287-319
Purpose: Catch requests that never receive a response
Behavior:
const requestTimeout = setTimeout(() => {
controller.abort();
setError({
type: 'timeout',
message: 'Request took too long to respond',
canRetry: true
});
}, 30000);
User Experience:
- Error message displayed: "Request timeout. Please try again."
- Retry button shown
- Loading state cleared
- Previous messages remain intact
Inactivity Timeout
Value: 45 seconds
Location: pipeline.ts:793-804
Purpose: Detect streams that stall mid-response
Behavior:
let lastEventTime = Date.now();
const inactivityTimeout = setTimeout(() => {
if (Date.now() - lastEventTime > 45000) {
controller.abort();
setError({
type: 'stalled',
message: 'Stream stalled. Connection lost.',
canRetry: true
});
}
}, 45000);
// Reset on each event
lastEventTime = Date.now();
User Experience:
- Partial response kept if any text received
- Error appended: "Connection interrupted. Retry?"
- Retry preserves conversation context
Error Types & Recovery
1. Network Errors
Causes:
- Loss of internet connection
- Server unreachable
- DNS failures
Detection:
try {
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify(payload)
});
} catch (networkError) {
// Handle network failure
}
Recovery:
- Show network error message
- Enable retry button
- Preserve user's input
- Don't persist failed message
2. Parse Errors
Causes:
- Malformed JSON in SSE events
- Invalid event format
- Corrupted data
Detection:
try {
const data = JSON.parse(event.data);
} catch (parseError) {
console.error('Event parse failed:', {
event: event.type,
raw: event.data,
error: parseError.message
});
// Continue processing other events
}
Recovery:
- Log error for debugging
- Skip malformed event
- Continue processing stream
- Don't crash UI
3. Validation Errors
Causes:
- Missing required fields
- Invalid data types
- Out-of-range values
Detection:
if (!isValidProduct(product)) {
console.warn('Invalid product:', product);
return; // Skip invalid item
}
Recovery:
- Filter out invalid items
- Show valid data
- Log validation failures
- Don't break entire response
4. Streaming Interruptions
Causes:
- Backend crashes mid-response
- Connection drops
- Browser tab suspended
Detection:
- Inactivity timeout triggers
- Stream ends without
doneevent - Unexpected EOF
Recovery:
- Preserve partial response
- Mark as incomplete
- Show retry option
- Explain what happened
UX Guards
Skeleton Control
Location: pipeline.ts:778-788 and config.ts:3-16
Logic:
const shouldShowSkeleton =
toolsExpected &&
(accumulatedText.length >= 160 ||
Date.now() - firstEventTime >= 800);
if (shouldShowSkeleton && !showSkeleton) {
setShowSkeleton(true);
}
Purpose:
- Don't show skeleton too early (looks glitchy)
- Ensure skeleton appears if products are coming
- Hide once real products arrive
Configuration:
MESSAGE_STREAMING_CONFIG = {
SKELETON_CHAR_THRESHOLD: 160, // characters
SKELETON_TIME_THRESHOLD_MS: 800 // milliseconds
}
Context Preservation
Location: pipeline.ts:488-540, 752-759, 988-1002
Strategy:
// Set from metadata
if (metadata.contextData) {
setContextData(metadata.contextData);
}
// Preserve through streaming
onNewMessage({
...assistantMessage,
contextData: contextData // Keep context
});
// Persist with final message
finalMessage.contextData = contextData;
Purpose:
- Context pills always display correctly
- "Show more" functionality works
- Conversation history maintains context
- Debugging has full information
Show-More Safety
Location: pipeline.ts:172-214, 574-640
Mechanism:
// Include in payload
payload.excludeProductIds = Array.from(shownProductIds);
payload.lastSearchParams = lastSearchParams;
// Update from metadata
if (metadata.searchParams) {
setLastSearchParams(metadata.searchParams);
}
// Propagate to next request
const newExcludes = [
...excludeProductIds,
...newProductIds
];
setShownProductIds(new Set(newExcludes));
Purpose:
- Never show duplicate products
- Maintain search continuity
- Preserve user's exploration path
- Handle pool exhaustion gracefully
Persistence Strategy
Location:
- User:
pipeline.ts:226-266 - Assistant:
index.ts:302-425andpipeline.ts:902-1018
User Message Strategy:
// Fire-and-forget
persistUserMessage(userMessage).catch(error => {
console.error('User message save failed:', error);
// Don't block UI, log for retry
});
Assistant Message Strategy:
// Silent persistence
try {
await saveMessage(finalMessage, {
silent: true, // Don't overwrite UI
preserveContext: true
});
} catch (error) {
console.error('Assistant save failed:', error);
// Schedule retry
scheduleRetry(finalMessage);
}
Purpose:
- User message saved ASAP
- Assistant saved only when complete
- Don't lose context/metrics
- Retry silently on failure
Error Messages
User-Facing Messages
const ERROR_MESSAGES = {
NETWORK: 'Connection lost. Check your internet and try again.',
TIMEOUT: 'Request took too long. The server might be busy.',
STALLED: 'Stream interrupted. Your partial response is saved.',
SERVER_ERROR: 'Something went wrong on our end. We\'re working on it.',
PARSE_ERROR: 'Received invalid data. Please retry your request.',
VALIDATION_ERROR: 'Some results couldn\'t be displayed. Try refining your search.'
};
Developer Messages
// Logged to console with full context
console.error('Pipeline error:', {
type: 'timeout',
stage: 'streaming',
duration: elapsed,
partialData: {
textLength: text.length,
productsReceived: products.length,
eventsProcessed: eventCount
},
recovery: 'user-retry-available'
});
Monitoring Integration
Metrics Captured
{
// Success metrics
ttfcMs: number,
totalDurationMs: number,
eventsReceived: number,
productsDelivered: number,
// Error metrics
timeouts: number,
parseErrors: number,
networkErrors: number,
retries: number,
// Recovery metrics
partialSuccesses: number,
retrySuccessRate: number
}
Alert Thresholds
ALERT_THRESHOLDS = {
TIMEOUT_RATE: 0.05, // >5% timeouts → alert
ERROR_RATE: 0.10, // >10% errors → alert
RETRY_RATE: 0.15, // >15% retries → alert
TTFC_P95: 2000, // >2s TTFC → alert
PARSE_ERROR_RATE: 0.01 // >1% parse errors → alert
}
Testing Resilience
Timeout Testing
// Simulate slow backend
test('handles request timeout gracefully', async () => {
const mockSlowResponse = delay(35000); // >30s
const result = await streamMessage(input);
expect(result.error).toBe('timeout');
expect(result.canRetry).toBe(true);
});
Network Failure Testing
// Simulate connection loss
test('recovers from network interruption', async () => {
const mockNetworkError = new Error('Failed to fetch');
const result = await streamMessage(input);
expect(result.error).toBe('network');
expect(result.partialData).toBeDefined();
});
Parse Error Testing
// Malformed JSON
test('continues despite parse errors', async () => {
const mockEvents = [
{ type: 'text', data: '"valid"' },
{ type: 'metadata', data: '{invalid json}' },
{ type: 'text', data: '"also valid"' }
];
const result = await streamMessage(input, mockEvents);
expect(result.text).toContain('valid also valid');
expect(result.parseErrors).toBe(1);
});
Best Practices
1. Always Provide Retry
Every error should offer a retry option unless it's a validation error.
2. Preserve Partial Success
If user got any value, keep it and allow continuation.
3. Log Everything
Rich error context helps diagnose production issues.
4. Timeout Early
Better to timeout and retry than hang indefinitely.
5. Fail Gracefully
UI should never crash due to backend issues.
Related Documentation
- Lifecycle & Flow - Normal operation flow
- Event Handling - Event parsing and validation
- Extension Guide - Adding resilient features