AI Service

Technical AI job service for Portal workloads.

The first version owns only AI job lifecycle and metrics. Business data stays in domain services such as telephony, monitoring-tg and monitoring-pf.

Generic job contract

The service is intentionally domain-agnostic:

owner_service names the caller, for example telephony, monitoring-tg, monitoring-pf or a future Portal module.
owner_ref is the caller's stable object reference, for example beeline/{call_id} or channel/{message_id}.
task_type describes the technical task class, for example transcribe, call_analysis, tg_analysis, pf_competitor_analysis.
model_profile selects a runtime profile, for example whisperx, qwen2.5-14b, vision, or a future provider profile.
input and result are JSON payloads owned by the caller and worker.

This keeps AI service as shared infrastructure rather than a telephony-specific service.

Built-in workers

The first built-in worker processes llm_chat, chat_completion and call_analysis jobs whose model_profile equals LLM_MODEL.

Input can be either explicit messages:

{
  "messages": [
    {"role": "system", "content": "Answer as JSON."},
    {"role": "user", "content": "Classify this text"}
  ],
  "max_tokens": 256
}

or compact system / user fields. The completed job result contains content, model, usage and duration_ms.

call_analysis uses the same input contract as llm_chat; callers may include domain metadata fields in input, but the worker only reads chat fields such as system, user, messages, max_tokens and response_format.

transcription jobs can run several transcription providers in order for temporary A/B comparison. The main segments field remains compatible with telephony and contains the first successful provider result. The full comparison is stored in attempts with provider, model, status, text, segments, duration_ms and error.

Recommended comparison order:

whisperx
qwen2-audio (Qwen/Qwen2-Audio-7B-Instruct)
voxtral-small (mistralai/Voxtral-Small-24B-2507)

Qwen2-Audio and Voxtral are called through an OpenAI-compatible /v1/chat/completions endpoint with vLLM-style audio_url data URLs; set their endpoint URLs only after the models are actually exposed on the AI server.

AI-server compose snippets for these temporary comparison endpoints live in deploy/ai-server/docker-compose.audio.yml. They are profile-gated because the single GPU cannot keep the production text vLLM, two WhisperX instances, Qwen2 Audio and Voxtral loaded at the same time:

Qwen2-Audio endpoint: http://10.2.3.5:8003
Voxtral endpoint: http://10.2.3.5:8004
Start Qwen only: docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile qwen-audio up -d qwen-audio
Start Voxtral only: docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile voxtral-small up -d voxtral-small

API

POST /api/v1/jobs creates one job.
GET /api/v1/jobs lists jobs with query filters.
POST /api/v1/jobs/batch creates many jobs with shared defaults.
POST /api/v1/jobs/retry retries failed/running jobs by filter.
POST /api/v1/jobs/cancel cancels pending/running jobs by filter.
POST /api/v1/jobs/claim atomically claims pending jobs for a worker.
GET /api/v1/jobs/{id} returns technical job state and result.
POST /api/v1/jobs/{id}/complete stores a successful job result.
POST /api/v1/jobs/{id}/fail stores a failed job category and message.
POST /api/v1/jobs/{id}/retry resets failed/running jobs to pending.
GET /api/v1/stats returns queue and error counters.
GET /api/v1/providers/status checks configured AI providers without returning secrets.
GET /api/v1/infra/status returns AI-server sidecar telemetry (GPU, containers, vLLM and WhisperX live metrics) when configured.
GET /healthz returns process health.
GET /readyz checks PostgreSQL readiness.
Built-in workers expose open Kubernetes endpoints on WORKER_HTTP_PORT: GET /healthz, GET /readyz and GET /worker/status.

All /api/v1/* endpoints require Authorization: Bearer <AI_SERVICE_TOKEN> when AI_SERVICE_TOKEN is configured. Health and readiness endpoints stay open for Kubernetes probes.

Configuration

HTTP_HOST, default 0.0.0.0
HTTP_PORT, default 8080
DATABASE_URL, required
MIGRATE_ON_START, default true
AI_SERVICE_TOKEN, optional bearer token for service-to-service API calls
LLM_BASE_URL, primary OpenAI-compatible LLM endpoint
LLM_API_KEY, primary LLM API key
LLM_MODEL, default qwen2.5-14b
LLM_TIMEOUT, default 5m
TRANSCRIPTION_PROVIDERS, default whisperx, comma-separated ordered list: whisperx,qwen2-audio,voxtral-small
WHISPERX_URL, WhisperX endpoint for transcription jobs
QWEN_AUDIO_BASE_URL, OpenAI-compatible endpoint for Qwen2-Audio
QWEN_AUDIO_MODEL, default Qwen/Qwen2-Audio-7B-Instruct
QWEN_AUDIO_API_KEY, optional bearer token for Qwen2-Audio; falls back to AUDIO_LLM_API_KEY, then LLM_API_KEY
VOXTRAL_BASE_URL, OpenAI-compatible endpoint for Voxtral
VOXTRAL_MODEL, default mistralai/Voxtral-Small-24B-2507
VOXTRAL_API_KEY, optional bearer token for Voxtral; falls back to AUDIO_LLM_API_KEY, then LLM_API_KEY
AUDIO_LLM_PROMPT, transcription instruction for audio LLM providers
AUDIO_LLM_MAX_TOKENS, default 4096
WORKER_ID, default hostname
WORKER_HTTP_HOST, default 0.0.0.0
WORKER_HTTP_PORT, default 8081
WORKER_POLL_INTERVAL, default 2s
WORKER_CLAIM_LIMIT, default 4
WORKER_LEASE_TIMEOUT, default 15m

Next integration step

telephony should first mirror low-risk analysis jobs into this service while continuing local processing. Remote execution can then be enabled by feature flag per task type.

5.8 KiB Raw Blame History

AI Service

Generic job contract

Built-in workers

API

Configuration

Next integration step

5.8 KiB

Raw Blame History