# AI Service Technical AI job service for Portal workloads. The first version owns only AI job lifecycle and metrics. Business data stays in domain services such as `telephony`, `monitoring-tg` and `monitoring-pf`. ## Generic job contract The service is intentionally domain-agnostic: - `owner_service` names the caller, for example `telephony`, `monitoring-tg`, `monitoring-pf` or a future Portal module. - `owner_ref` is the caller's stable object reference, for example `beeline/{call_id}` or `channel/{message_id}`. - `task_type` describes the technical task class, for example `transcribe`, `call_analysis`, `tg_analysis`, `pf_competitor_analysis`. - `model_profile` selects a runtime profile, for example `whisperx`, `qwen2.5-14b`, `vision`, or a future provider profile. - `input` and `result` are JSON payloads owned by the caller and worker. This keeps AI service as shared infrastructure rather than a telephony-specific service. ## Built-in workers The first built-in worker processes `llm_chat`, `chat_completion` and `call_analysis` jobs whose `model_profile` equals `LLM_MODEL`. Input can be either explicit messages: ```json { "messages": [ {"role": "system", "content": "Answer as JSON."}, {"role": "user", "content": "Classify this text"} ], "max_tokens": 256 } ``` or compact `system` / `user` fields. The completed job result contains `content`, `model`, `usage` and `duration_ms`. `call_analysis` uses the same input contract as `llm_chat`; callers may include domain metadata fields in `input`, but the worker only reads chat fields such as `system`, `user`, `messages`, `max_tokens` and `response_format`. `transcription` jobs can run several transcription providers in order for temporary A/B comparison. The main `segments` field remains compatible with telephony and contains the first successful provider result. The full comparison is stored in `attempts` with `provider`, `model`, `status`, `text`, `segments`, `duration_ms` and `error`. Recommended comparison order: 1. `whisperx` 2. `qwen2-audio` (`Qwen/Qwen2-Audio-7B-Instruct`) 3. `voxtral-small` (`mistralai/Voxtral-Small-24B-2507`) Qwen2-Audio and Voxtral are called through an OpenAI-compatible `/v1/chat/completions` endpoint with `input_audio`; set their endpoint URLs only after the models are actually exposed on the AI server. ## API - `POST /api/v1/jobs` creates one job. - `GET /api/v1/jobs` lists jobs with query filters. - `POST /api/v1/jobs/batch` creates many jobs with shared defaults. - `POST /api/v1/jobs/retry` retries failed/running jobs by filter. - `POST /api/v1/jobs/cancel` cancels pending/running jobs by filter. - `POST /api/v1/jobs/claim` atomically claims pending jobs for a worker. - `GET /api/v1/jobs/{id}` returns technical job state and result. - `POST /api/v1/jobs/{id}/complete` stores a successful job result. - `POST /api/v1/jobs/{id}/fail` stores a failed job category and message. - `POST /api/v1/jobs/{id}/retry` resets failed/running jobs to `pending`. - `GET /api/v1/stats` returns queue and error counters. - `GET /api/v1/providers/status` checks configured AI providers without returning secrets. - `GET /api/v1/infra/status` returns AI-server sidecar telemetry (GPU, containers, vLLM and WhisperX live metrics) when configured. - `GET /healthz` returns process health. - `GET /readyz` checks PostgreSQL readiness. - Built-in workers expose open Kubernetes endpoints on `WORKER_HTTP_PORT`: `GET /healthz`, `GET /readyz` and `GET /worker/status`. All `/api/v1/*` endpoints require `Authorization: Bearer ` when `AI_SERVICE_TOKEN` is configured. Health and readiness endpoints stay open for Kubernetes probes. ## Configuration - `HTTP_HOST`, default `0.0.0.0` - `HTTP_PORT`, default `8080` - `DATABASE_URL`, required - `MIGRATE_ON_START`, default `true` - `AI_SERVICE_TOKEN`, optional bearer token for service-to-service API calls - `LLM_BASE_URL`, primary OpenAI-compatible LLM endpoint - `LLM_API_KEY`, primary LLM API key - `LLM_MODEL`, default `qwen2.5-14b` - `LLM_TIMEOUT`, default `5m` - `TRANSCRIPTION_PROVIDERS`, default `whisperx`, comma-separated ordered list: `whisperx,qwen2-audio,voxtral-small` - `WHISPERX_URL`, WhisperX endpoint for transcription jobs - `QWEN_AUDIO_BASE_URL`, OpenAI-compatible endpoint for Qwen2-Audio - `QWEN_AUDIO_MODEL`, default `Qwen/Qwen2-Audio-7B-Instruct` - `QWEN_AUDIO_API_KEY`, optional bearer token for Qwen2-Audio - `VOXTRAL_BASE_URL`, OpenAI-compatible endpoint for Voxtral - `VOXTRAL_MODEL`, default `mistralai/Voxtral-Small-24B-2507` - `VOXTRAL_API_KEY`, optional bearer token for Voxtral - `AUDIO_LLM_PROMPT`, transcription instruction for audio LLM providers - `AUDIO_LLM_MAX_TOKENS`, default `4096` - `WORKER_ID`, default hostname - `WORKER_HTTP_HOST`, default `0.0.0.0` - `WORKER_HTTP_PORT`, default `8081` - `WORKER_POLL_INTERVAL`, default `2s` - `WORKER_CLAIM_LIMIT`, default `4` - `WORKER_LEASE_TIMEOUT`, default `15m` ## Next integration step `telephony` should first mirror low-risk analysis jobs into this service while continuing local processing. Remote execution can then be enabled by feature flag per task type.