5.8 KiB
AI Service
Technical AI job service for Portal workloads.
The first version owns only AI job lifecycle and metrics. Business data stays in
domain services such as telephony, monitoring-tg and monitoring-pf.
Generic job contract
The service is intentionally domain-agnostic:
owner_servicenames the caller, for exampletelephony,monitoring-tg,monitoring-pfor a future Portal module.owner_refis the caller's stable object reference, for examplebeeline/{call_id}orchannel/{message_id}.task_typedescribes the technical task class, for exampletranscribe,call_analysis,tg_analysis,pf_competitor_analysis.model_profileselects a runtime profile, for examplewhisperx,qwen2.5-14b,vision, or a future provider profile.inputandresultare JSON payloads owned by the caller and worker.
This keeps AI service as shared infrastructure rather than a telephony-specific service.
Built-in workers
The first built-in worker processes llm_chat, chat_completion and
call_analysis jobs whose model_profile equals LLM_MODEL.
Input can be either explicit messages:
{
"messages": [
{"role": "system", "content": "Answer as JSON."},
{"role": "user", "content": "Classify this text"}
],
"max_tokens": 256
}
or compact system / user fields. The completed job result contains
content, model, usage and duration_ms.
call_analysis uses the same input contract as llm_chat; callers may include
domain metadata fields in input, but the worker only reads chat fields such as
system, user, messages, max_tokens and response_format.
transcription jobs can run several transcription providers in order for
temporary A/B comparison. The main segments field remains compatible with
telephony and contains the first successful provider result. The full comparison
is stored in attempts with provider, model, status, text, segments,
duration_ms and error.
Recommended comparison order:
whisperxqwen2-audio(Qwen/Qwen2-Audio-7B-Instruct)voxtral-small(mistralai/Voxtral-Small-24B-2507)
Qwen2-Audio and Voxtral are called through an OpenAI-compatible
/v1/chat/completions endpoint with vLLM-style audio_url data URLs; set
their endpoint URLs only after the models are actually exposed on the AI server.
AI-server compose snippets for these temporary comparison endpoints live in
deploy/ai-server/docker-compose.audio.yml. They are profile-gated because the
single GPU cannot keep the production text vLLM, two WhisperX instances, Qwen2
Audio and Voxtral loaded at the same time:
- Qwen2-Audio endpoint:
http://10.2.3.5:8003 - Voxtral endpoint:
http://10.2.3.5:8004 - Start Qwen only:
docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile qwen-audio up -d qwen-audio - Start Voxtral only:
docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile voxtral-small up -d voxtral-small
API
POST /api/v1/jobscreates one job.GET /api/v1/jobslists jobs with query filters.POST /api/v1/jobs/batchcreates many jobs with shared defaults.POST /api/v1/jobs/retryretries failed/running jobs by filter.POST /api/v1/jobs/cancelcancels pending/running jobs by filter.POST /api/v1/jobs/claimatomically claims pending jobs for a worker.GET /api/v1/jobs/{id}returns technical job state and result.POST /api/v1/jobs/{id}/completestores a successful job result.POST /api/v1/jobs/{id}/failstores a failed job category and message.POST /api/v1/jobs/{id}/retryresets failed/running jobs topending.GET /api/v1/statsreturns queue and error counters.GET /api/v1/providers/statuschecks configured AI providers without returning secrets.GET /api/v1/infra/statusreturns AI-server sidecar telemetry (GPU, containers, vLLM and WhisperX live metrics) when configured.GET /healthzreturns process health.GET /readyzchecks PostgreSQL readiness.- Built-in workers expose open Kubernetes endpoints on
WORKER_HTTP_PORT:GET /healthz,GET /readyzandGET /worker/status.
All /api/v1/* endpoints require Authorization: Bearer <AI_SERVICE_TOKEN>
when AI_SERVICE_TOKEN is configured. Health and readiness endpoints stay open
for Kubernetes probes.
Configuration
HTTP_HOST, default0.0.0.0HTTP_PORT, default8080DATABASE_URL, requiredMIGRATE_ON_START, defaulttrueAI_SERVICE_TOKEN, optional bearer token for service-to-service API callsLLM_BASE_URL, primary OpenAI-compatible LLM endpointLLM_API_KEY, primary LLM API keyLLM_MODEL, defaultqwen2.5-14bLLM_TIMEOUT, default5mTRANSCRIPTION_PROVIDERS, defaultwhisperx, comma-separated ordered list:whisperx,qwen2-audio,voxtral-smallWHISPERX_URL, WhisperX endpoint for transcription jobsQWEN_AUDIO_BASE_URL, OpenAI-compatible endpoint for Qwen2-AudioQWEN_AUDIO_MODEL, defaultQwen/Qwen2-Audio-7B-InstructQWEN_AUDIO_API_KEY, optional bearer token for Qwen2-Audio; falls back toAUDIO_LLM_API_KEY, thenLLM_API_KEYVOXTRAL_BASE_URL, OpenAI-compatible endpoint for VoxtralVOXTRAL_MODEL, defaultmistralai/Voxtral-Small-24B-2507VOXTRAL_API_KEY, optional bearer token for Voxtral; falls back toAUDIO_LLM_API_KEY, thenLLM_API_KEYAUDIO_LLM_PROMPT, transcription instruction for audio LLM providersAUDIO_LLM_MAX_TOKENS, default4096WORKER_ID, default hostnameWORKER_HTTP_HOST, default0.0.0.0WORKER_HTTP_PORT, default8081WORKER_POLL_INTERVAL, default2sWORKER_CLAIM_LIMIT, default4WORKER_LEASE_TIMEOUT, default15m
Next integration step
telephony should first mirror low-risk analysis jobs into this service while
continuing local processing. Remote execution can then be enabled by feature
flag per task type.