143 lines
5.9 KiB
Markdown
143 lines
5.9 KiB
Markdown
# AI Service
|
|
|
|
Technical AI job service for Portal workloads.
|
|
|
|
AI Service owns technical AI job lifecycle, provider execution and metrics.
|
|
Business data stays in domain services such as `telephony`, `monitoring-tg` and
|
|
`monitoring-pf`.
|
|
|
|
## Generic job contract
|
|
|
|
The service is intentionally domain-agnostic:
|
|
|
|
- `owner_service` names the caller, for example `telephony`, `monitoring-tg`,
|
|
`monitoring-pf` or a future Portal module.
|
|
- `owner_ref` is the caller's stable object reference, for example
|
|
`beeline/{call_id}` or `channel/{message_id}`.
|
|
- `task_type` describes the technical task class, for example
|
|
`transcription`, `transcript_summary`, `call_analysis`,
|
|
`telegram_classification`, `tg_analysis`, `pf_competitor_analysis`.
|
|
- `model_profile` selects a runtime profile, for example `whisper-large-v3`,
|
|
`qwen2.5-14b`, `vision`, or a future provider profile.
|
|
- `input` and `result` are JSON payloads owned by the caller and worker.
|
|
|
|
This keeps AI service as shared infrastructure rather than a telephony-specific
|
|
service.
|
|
|
|
## Built-in workers
|
|
|
|
The LLM worker processes `llm_chat`, `chat_completion`, `call_analysis`,
|
|
`transcript_summary` and `telegram_classification` jobs whose `model_profile`
|
|
equals `LLM_MODEL`.
|
|
|
|
Input can be either explicit messages:
|
|
|
|
```json
|
|
{
|
|
"messages": [
|
|
{"role": "system", "content": "Answer as JSON."},
|
|
{"role": "user", "content": "Classify this text"}
|
|
],
|
|
"max_tokens": 256
|
|
}
|
|
```
|
|
|
|
or compact `system` / `user` fields. The completed job result contains
|
|
`content`, `model`, `usage` and `duration_ms`.
|
|
|
|
`call_analysis` and `transcript_summary` use the same input contract as
|
|
`llm_chat`; callers may include domain metadata fields in `input`, but the
|
|
worker only reads chat fields such as `system`, `user`, `messages`,
|
|
`max_tokens` and `response_format`.
|
|
|
|
`transcription` jobs are processed only by Whisper Large v3
|
|
(`openai/whisper-large-v3`) through an OpenAI-compatible
|
|
`/v1/audio/transcriptions` endpoint. The returned `segments` field stays
|
|
compatible with telephony. If the provider returns one long segment, AI Service
|
|
splits it into smaller transcript segments without inventing speaker labels.
|
|
|
|
AI-server compose snippet for Whisper Large v3 lives in
|
|
`deploy/ai-server/docker-compose.audio.yml`:
|
|
|
|
- Whisper endpoint: `http://10.2.3.5:8004`
|
|
- Start Whisper:
|
|
`docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile whisper-large-v3 up -d whisper-large-v3`
|
|
|
|
In Kubernetes the dedicated transcription worker may claim more than one
|
|
`whisper-large-v3` job at a time. This keeps download/upload/wait overhead from
|
|
serializing the queue while the Whisper provider still controls the actual GPU
|
|
scheduling.
|
|
|
|
## API
|
|
|
|
- `POST /api/v1/jobs` creates one job.
|
|
- `GET /api/v1/jobs` lists jobs with query filters.
|
|
- `POST /api/v1/jobs/batch` creates many jobs with shared defaults.
|
|
- `POST /api/v1/jobs/retry` retries failed/running jobs by filter.
|
|
- `POST /api/v1/jobs/cancel` cancels pending/running jobs by filter.
|
|
- `POST /api/v1/jobs/claim` atomically claims pending jobs for a worker.
|
|
- `GET /api/v1/jobs/{id}` returns technical job state and result.
|
|
- `POST /api/v1/jobs/{id}/complete` stores a successful job result.
|
|
- `POST /api/v1/jobs/{id}/fail` stores a failed job category and message.
|
|
- `POST /api/v1/jobs/{id}/retry` resets failed/running jobs to `pending`.
|
|
- `GET /api/v1/stats` returns queue and error counters.
|
|
- `GET /api/v1/providers/status` checks configured AI providers without
|
|
returning secrets.
|
|
- `GET /api/v1/infra/status` returns AI-server sidecar telemetry
|
|
(GPU, containers and vLLM live metrics) when configured.
|
|
- `GET /health/detail` returns PostgreSQL, provider, queue, error, throughput
|
|
and infra components for Portal `admin/health`.
|
|
- `GET /healthz` returns process health.
|
|
- `GET /readyz` checks PostgreSQL readiness.
|
|
- Built-in workers expose open Kubernetes endpoints on `WORKER_HTTP_PORT`:
|
|
`GET /healthz`, `GET /readyz` and `GET /worker/status`.
|
|
|
|
All `/api/v1/*` endpoints require `Authorization: Bearer <AI_SERVICE_TOKEN>`
|
|
when `AI_SERVICE_TOKEN` is configured. Health and readiness endpoints stay open
|
|
for Kubernetes probes.
|
|
|
|
## Retry policy
|
|
|
|
Workers store a normalized `error_code` on failed jobs. AI Service requeues only
|
|
explicitly retryable categories while attempts remain.
|
|
|
|
| Category | Retry | Delay |
|
|
| --- | --- | --- |
|
|
| `provider_unavailable`, `model_unavailable`, `provider_error`, `dependency_error`, `timeout`, `storage_error`, `stale_worker` | yes | 30s |
|
|
| `bad_response`, `transcript_hallucination`, `transcript_incomplete`, `internal_error`, `unknown` | yes | 2m |
|
|
| `bad_audio`, `bad_input`, `context_length`, `unsupported_task`, `cancelled` | no | - |
|
|
|
|
Domain services may still expose manual retry for terminal errors after the
|
|
underlying data or prompt is corrected.
|
|
|
|
## Configuration
|
|
|
|
- `HTTP_HOST`, default `0.0.0.0`
|
|
- `HTTP_PORT`, default `8080`
|
|
- `DATABASE_URL`, required
|
|
- `MIGRATE_ON_START`, default `true`
|
|
- `AI_SERVICE_TOKEN`, optional bearer token for service-to-service API calls
|
|
- `LLM_BASE_URL`, primary OpenAI-compatible LLM endpoint
|
|
- `LLM_API_KEY`, primary LLM API key
|
|
- `LLM_MODEL`, default `qwen2.5-14b`
|
|
- `LLM_TIMEOUT`, default `5m`
|
|
- `AUDIO_TRANSCRIPTION_BASE_URL`, OpenAI-compatible transcription endpoint
|
|
- `AUDIO_TRANSCRIPTION_MODEL`, default `openai/whisper-large-v3`
|
|
- `AUDIO_TRANSCRIPTION_API_KEY`, optional bearer token; falls back to
|
|
`AUDIO_LLM_API_KEY`, then `LLM_API_KEY`
|
|
- `AUDIO_TRANSCRIPTION_PROMPT`, transcription instruction
|
|
- `WORKER_ID`, default hostname
|
|
- `WORKER_HTTP_HOST`, default `0.0.0.0`
|
|
- `WORKER_HTTP_PORT`, default `8081`
|
|
- `WORKER_POLL_INTERVAL`, default `2s`
|
|
- `WORKER_CLAIM_LIMIT`, default `4`
|
|
- `WORKER_LEASE_TIMEOUT`, default `15m`
|
|
|
|
## Current telephony pipeline
|
|
|
|
`telephony` now uses AI Service as the only AI execution path:
|
|
|
|
1. `transcription` turns call audio into segments.
|
|
2. `transcript_summary` creates a detailed Russian call summary.
|
|
3. `call_analysis` runs tags and negotiation rules against the summary.
|