or-llm-pipeline' => [
'connection' => 'redis',
'queue' => ['inference-batch', 'inference-realtime', 'inference-async'],
'balance' => 'auto',
'autoScalingStrategy' => 'time',
'minProcesses' => 4,
'maxProcesses' => 16,
'balanceMaxShift' => 3,
'balanceCooldown' => 8,
'timeout' => 300,
'sleep' => 5,
'tries' => 5,
'nice' => 0,
],
'supervisor-standard' => [
'connection' => 'redis',
'queue' => ['default', 'emails', 'webhooks'],
'balance' => 'simple',
'minProcesses'=> 2,
'maxProcesses'=> 8,
'timeout' => 60,
'sleep' => 3,
'tries' => 3,
],
],
],
];
**Architecture Rationale:**
- `autoScalingStrategy: time` measures how long jobs sit in the queue before pickup. Queue length is misleading for AI workloads: three jobs waiting at 90 seconds each creates a 4.5-minute tail latency. Time-based scaling provisions workers based on actual user wait time.
- `balanceCooldown: 8` prevents thrashing. Inference workloads often arrive in bursts (e.g., batch document uploads). A 3-second cooldown causes the auto-balancer to over-provision, then rapidly scale down, wasting Redis connections and CPU cycles.
- `timeout: 300` establishes a hard ceiling. This is not a target execution time; it is a safety net. If jobs routinely approach 120 seconds, prompt optimization or context window reduction is required.
### Step 2: Align the Process Manager Grace Period
Horizon runs as a daemon. During deployments, the process manager (Supervisord, systemd, or PM2) sends a termination signal. If the grace period is shorter than Horizon's timeout, in-flight inference calls are killed mid-stream.
```ini
; /etc/supervisor/conf.d/laravel-horizon.conf
[program:horizon-worker]
process_name=%(program_name)s
command=php /var/www/app/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/app/storage/logs/horizon-worker.log
stopwaitsecs=360
Architecture Rationale:
stopwaitsecs must exceed the Horizon timeout by at least 60 seconds. This guarantees that a worker processing a 240-second inference call can complete the request, persist results, and gracefully exit before the OS forces termination. Rolling deployments will no longer truncate active API calls.
Step 3: Design the Job Class for State Awareness and Rate Limit Resilience
The supervisor defines the outer boundary. The job class defines internal behavior. AI inference jobs require explicit timeout declaration, exponential backoff, rate limit differentiation, and partial state preservation.
<?php
namespace App\Jobs\Inference;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\Middleware\RateLimited;
use Illuminate\Support\Facades\Log;
use App\Services\InferenceClient;
use App\Models\AnalysisTask;
class ExecuteModelInference implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public int $timeout = 240;
public int $tries = 5;
public array $backoff = [30, 60, 120, 180, 240];
public function __construct(
public readonly string $taskId,
public readonly string $targetModel,
public readonly array $payload,
) {}
public function middleware(): array
{
return [new RateLimited('llm-inference-gateway')];
}
public function handle(InferenceClient $client): void
{
$task = AnalysisTask::findOrFail($this->taskId);
try {
$result = $client->generate(
model: $this->targetModel,
payload: $this->payload,
timeout: $this->timeout
);
$task->update([
'status' => 'completed',
'output_text' => $result->text,
'input_tokens' => $result->usage->promptTokens,
'output_tokens' => $result->usage->completionTokens,
'completed_at' => now(),
]);
} catch (\Throwable $exception) {
if ($this->isRateLimitSignal($exception)) {
$delay = $this->backoff[$this->attempts() - 1] ?? 240;
$this->release($delay);
return;
}
Log::error('Inference execution failed', [
'task_id' => $this->taskId,
'attempt' => $this->attempts(),
'model' => $this->targetModel,
'error' => $exception->getMessage(),
]);
throw $exception;
}
}
public function failed(\Throwable $exception): void
{
AnalysisTask::where('id', $this->taskId)->update([
'status' => 'failed',
'failure_reason' => $exception->getMessage(),
'partial_output' => $this->extractPartialState(),
'failed_at' => now(),
]);
Log::critical('Inference job exhausted retry budget', [
'task_id' => $this->taskId,
'model' => $this->targetModel,
]);
}
public function retryUntil(): \DateTime
{
return now()->addHours(4);
}
private function isRateLimitSignal(\Throwable $e): bool
{
$message = strtolower($e->getMessage());
return str_contains($message, '429')
|| str_contains($message, 'rate_limit')
|| str_contains($message, 'too_many_requests');
}
private function extractPartialState(): ?string
{
// Retrieve cached chunks or streaming buffer if available
return cache()->get("inference_partial_{$this->taskId}");
}
}
Architecture Rationale:
$timeout = 240 sits below the supervisor's 300-second limit. This ensures Laravel can catch the timeout, log it, and trigger the failed() method instead of receiving an uncatchable SIGKILL.
$this->release() is used for rate limits instead of throwing. Throwing decrements the $tries counter. release() re-queues the job with a delay without consuming retry budget, treating 429 as a scheduling event rather than a failure.
retryUntil() enforces a business deadline. Exponential backoff across five attempts can span hours. If the inference result is only valuable within a 4-hour window, this prevents wasteful retries on stale requests.
failed() preserves partial state. Long context jobs often cache intermediate chunks. Storing partial_output enables resume logic or manual inspection, reducing redundant API costs.
Step 4: Register Granular Rate Limiters
The RateLimited middleware requires a named limiter. Global limits work for single-tenant setups, but multi-tenant applications require scoped throttling to prevent noisy neighbors from blocking inference pipelines.
// app/Providers/AppServiceProvider.php
use Illuminate\Cache\RateLimiting\Limit;
use Illuminate\Support\Facades\RateLimiter;
public function boot(): void
{
RateLimiter::for('llm-inference-gateway', function (object $job) {
$tenantScope = $job->tenantId ?? 'platform-wide';
// Anthropic Tier 2: ~1,000 RPM | OpenAI Tier 3: ~5,000 RPM
// Start conservative; adjust based on actual provider quota and cost targets.
return Limit::perMinute(80)->by("tenant:{$tenantScope}");
});
}
Architecture Rationale:
Scoping by tenant isolates rate limit exhaustion. If one tenant triggers a burst, other tenants' inference jobs continue processing. The limit should align with your provider tier, but always leave headroom for retry backoff and network variance.
Pitfall Guide
1. Timeout Parity Trap
Explanation: Setting the job $timeout equal to or greater than the Horizon supervisor timeout guarantees silent termination. The OS kills the process before Laravel can execute exception handling.
Fix: Always set job $timeout to 80% of the supervisor limit. For a 300-second supervisor, use 240 seconds on the job.
2. Treating 429 as a Hard Failure
Explanation: Throwing an exception on rate limit responses consumes retry budget and triggers exponential backoff incorrectly. Rate limits are provider-side scheduling signals, not application bugs.
Fix: Use $this->release($delay) for 429 responses. This preserves the $tries counter and respects the provider's recovery window.
3. Queue Length Scaling Fallacy
Explanation: Scaling workers based on job count ignores execution duration. Three AI jobs waiting is trivial for email dispatch but catastrophic for inference.
Fix: Use autoScalingStrategy: time. Horizon will provision workers based on actual queue wait time, aligning capacity with latency requirements.
4. Silent Deployment Truncation
Explanation: Leaving stopwaitsecs at the default 10 seconds in Supervisord causes rolling deployments to kill in-flight inference calls. Users receive empty responses without error logs.
Fix: Set stopwaitsecs to supervisor_timeout + 60. Verify with a staging deployment that long-running jobs complete before the process exits.
5. State Wipe on Failure
Explanation: Standard failed() methods often reset status fields without preserving intermediate work. For expensive context assembly or chunking, this forces full recomputation.
Fix: Implement partial state caching during execution. Store intermediate results in Redis or a dedicated partial_output column. Restore them in failed() for audit or resume capabilities.
6. Global Rate Limiter Bottlenecks
Explanation: Using a single global rate limiter in multi-tenant applications causes one tenant's burst to throttle all other tenants' inference pipelines.
Fix: Scope the limiter using by("tenant:{$id}"). Adjust limits per tier if you offer different SLAs.
7. Missing Idempotency Keys
Explanation: AI providers may process duplicate requests if network timeouts cause Laravel to retry. Without idempotency, you pay twice and generate conflicting outputs.
Fix: Generate a deterministic idempotency_key based on task hash and payload. Pass it to the provider API. Most modern LLM endpoints support idempotent retries.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / Low Volume | Single supervisor, global rate limit, 3 retries | Simplicity reduces operational overhead while validating product-market fit | Low infrastructure cost; acceptable retry waste |
| Multi-Tenant SaaS | Dedicated AI supervisor, tenant-scoped rate limits, 5 retries with backoff | Prevents noisy neighbor throttling and aligns scaling with actual latency | Moderate increase in Redis connections; reduced API waste from failed retries |
| Batch Processing / High Throughput | Time-based scaling, partial state caching, idempotency keys, 300s timeout | Handles burst uploads without blocking realtime queues; enables resume on failure | Higher worker count during peaks; significant savings from partial state reuse |
Configuration Template
// config/horizon.php
return [
'environments' => [
'production' => [
'supervisor-llm-pipeline' => [
'connection' => 'redis',
'queue' => ['inference-batch', 'inference-realtime'],
'balance' => 'auto',
'autoScalingStrategy' => 'time',
'minProcesses' => 4,
'maxProcesses' => 16,
'balanceMaxShift' => 3,
'balanceCooldown' => 8,
'timeout' => 300,
'sleep' => 5,
'tries' => 5,
'nice' => 0,
],
],
],
];
; /etc/supervisor/conf.d/laravel-horizon.conf
[program:horizon-worker]
process_name=%(program_name)s
command=php /var/www/app/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/app/storage/logs/horizon-worker.log
stopwaitsecs=360
Quick Start Guide
- Install Horizon & Publish Config: Run
composer require laravel/horizon && php artisan horizon:install. Open config/horizon.php and replace the default supervisor block with the AI-optimized template.
- Align Process Manager: Update your Supervisord or systemd unit file. Set
stopwaitsecs=360 and reload the service manager (supervisorctl reread && supervisorctl update).
- Create the Job Class: Generate a new job (
php artisan make:job ExecuteModelInference). Implement the $timeout, $tries, $backoff, and release() pattern for rate limits. Register the RateLimited middleware.
- Deploy & Validate: Push to staging. Dispatch a test job with a large context window. Monitor Horizon's dashboard for queue wait time scaling. Verify that
429 responses trigger release() without decrementing $tries. Confirm failed_jobs captures partial state on timeout.