The first version owns only AI job lifecycle and metrics. Business data stays in domain services such as telephony, monitoring-tg and monitoring-pf.

Generic job contract

The service is intentionally domain-agnostic:

owner_service names the caller, for example telephony, monitoring-tg, monitoring-pf or a future Portal module.
owner_ref is the caller's stable object reference, for example beeline/{call_id} or channel/{message_id}.
task_type describes the technical task class, for example transcribe, call_analysis, tg_analysis, pf_competitor_analysis.
model_profile selects a runtime profile, for example whisperx, qwen2.5-14b, vision, or a future provider profile.
input and result are JSON payloads owned by the caller and worker.

This keeps AI service as shared infrastructure rather than a telephony-specific service.

Built-in workers

The first built-in worker processes llm_chat, chat_completion and call_analysis jobs whose model_profile equals LLM_MODEL.

Input can be either explicit messages:

{
  "messages": [
    {"role": "system", "content": "Answer as JSON."},
    {"role": "user", "content": "Classify this text"}
  ],
  "max_tokens": 256
}

or compact system / user fields. The completed job result contains content, model, usage and duration_ms.

call_analysis uses the same input contract as llm_chat; callers may include domain metadata fields in input, but the worker only reads chat fields such as system, user, messages, max_tokens and response_format.

transcription jobs can run several transcription providers in order for temporary A/B comparison. The main segments field remains compatible with telephony and contains the first successful provider result. The full comparison is stored in attempts with provider, model, status, text, segments, duration_ms and error.

Recommended comparison order:

whisperx
qwen2-audio (Qwen/Qwen2-Audio-7B-Instruct)
voxtral-small (mistralai/Voxtral-Small-24B-2507)

Qwen2-Audio and Voxtral are called through an OpenAI-compatible /v1/chat/completions endpoint with input_audio; set their endpoint URLs only after the models are actually exposed on the AI server.

AI-server compose snippets for these temporary comparison endpoints live in deploy/ai-server/docker-compose.audio.yml. They are profile-gated because the single GPU cannot keep the production text vLLM, two WhisperX instances, Qwen2 Audio and Voxtral loaded at the same time:

Qwen2-Audio endpoint: http://10.2.3.5:8003
Voxtral endpoint: http://10.2.3.5:8004
Start Qwen only: docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile qwen-audio up -d qwen-audio
Start Voxtral only: docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile voxtral-small up -d voxtral-small

API

POST /api/v1/jobs creates one job.
GET /api/v1/jobs lists jobs with query filters.
POST /api/v1/jobs/batch creates many jobs with shared defaults.
POST /api/v1/jobs/retry retries failed/running jobs by filter.
POST /api/v1/jobs/cancel cancels pending/running jobs by filter.
POST /api/v1/jobs/claim atomically claims pending jobs for a worker.
GET /api/v1/jobs/{id} returns technical job state and result.
POST /api/v1/jobs/{id}/complete stores a successful job result.
POST /api/v1/jobs/{id}/fail stores a failed job category and message.
POST /api/v1/jobs/{id}/retry resets failed/running jobs to pending.
GET /api/v1/stats returns queue and error counters.
GET /api/v1/providers/status checks configured AI providers without returning secrets.
GET /api/v1/infra/status returns AI-server sidecar telemetry (GPU, containers, vLLM and WhisperX live metrics) when configured.
GET /healthz returns process health.
GET /readyz checks PostgreSQL readiness.
Built-in workers expose open Kubernetes endpoints on WORKER_HTTP_PORT: GET /healthz, GET /readyz and GET /worker/status.

All /api/v1/* endpoints require Authorization: Bearer <AI_SERVICE_TOKEN> when AI_SERVICE_TOKEN is configured. Health and readiness endpoints stay open for Kubernetes probes.

Configuration

HTTP_HOST, default 0.0.0.0
HTTP_PORT, default 8080
DATABASE_URL, required
MIGRATE_ON_START, default true
AI_SERVICE_TOKEN, optional bearer token for service-to-service API calls
LLM_BASE_URL, primary OpenAI-compatible LLM endpoint
LLM_API_KEY, primary LLM API key
LLM_MODEL, default qwen2.5-14b
LLM_TIMEOUT, default 5m
TRANSCRIPTION_PROVIDERS, default whisperx, comma-separated ordered list: whisperx,qwen2-audio,voxtral-small
WHISPERX_URL, WhisperX endpoint for transcription jobs
QWEN_AUDIO_BASE_URL, OpenAI-compatible endpoint for Qwen2-Audio
QWEN_AUDIO_MODEL, default Qwen/Qwen2-Audio-7B-Instruct
QWEN_AUDIO_API_KEY, optional bearer token for Qwen2-Audio; falls back to AUDIO_LLM_API_KEY, then LLM_API_KEY
VOXTRAL_BASE_URL, OpenAI-compatible endpoint for Voxtral
VOXTRAL_MODEL, default mistralai/Voxtral-Small-24B-2507
VOXTRAL_API_KEY, optional bearer token for Voxtral; falls back to AUDIO_LLM_API_KEY, then LLM_API_KEY
AUDIO_LLM_PROMPT, transcription instruction for audio LLM providers
AUDIO_LLM_MAX_TOKENS, default 4096
WORKER_ID, default hostname
WORKER_HTTP_HOST, default 0.0.0.0
WORKER_HTTP_PORT, default 8081
WORKER_POLL_INTERVAL, default 2s
WORKER_CLAIM_LIMIT, default 4
WORKER_LEASE_TIMEOUT, default 15m

Next integration step

telephony should first mirror low-risk analysis jobs into this service while continuing local processing. Remote execution can then be enabled by feature flag per task type.