feat: expand ai retry policy
This commit is contained in:
16
README.md
16
README.md
@@ -85,6 +85,8 @@ scheduling.
|
||||
returning secrets.
|
||||
- `GET /api/v1/infra/status` returns AI-server sidecar telemetry
|
||||
(GPU, containers and vLLM live metrics) when configured.
|
||||
- `GET /health/detail` returns PostgreSQL, provider, queue, error, throughput
|
||||
and infra components for Portal `admin/health`.
|
||||
- `GET /healthz` returns process health.
|
||||
- `GET /readyz` checks PostgreSQL readiness.
|
||||
- Built-in workers expose open Kubernetes endpoints on `WORKER_HTTP_PORT`:
|
||||
@@ -94,6 +96,20 @@ All `/api/v1/*` endpoints require `Authorization: Bearer <AI_SERVICE_TOKEN>`
|
||||
when `AI_SERVICE_TOKEN` is configured. Health and readiness endpoints stay open
|
||||
for Kubernetes probes.
|
||||
|
||||
## Retry policy
|
||||
|
||||
Workers store a normalized `error_code` on failed jobs. AI Service requeues only
|
||||
explicitly retryable categories while attempts remain.
|
||||
|
||||
| Category | Retry | Delay |
|
||||
| --- | --- | --- |
|
||||
| `provider_unavailable`, `model_unavailable`, `provider_error`, `dependency_error`, `timeout`, `storage_error`, `stale_worker` | yes | 30s |
|
||||
| `bad_response`, `transcript_hallucination`, `transcript_incomplete`, `internal_error`, `unknown` | yes | 2m |
|
||||
| `bad_audio`, `bad_input`, `context_length`, `unsupported_task`, `cancelled` | no | - |
|
||||
|
||||
Domain services may still expose manual retry for terminal errors after the
|
||||
underlying data or prompt is corrected.
|
||||
|
||||
## Configuration
|
||||
|
||||
- `HTTP_HOST`, default `0.0.0.0`
|
||||
|
||||
Reference in New Issue
Block a user