Files
ai-service/README.md
Grendgi e074f6b226
Some checks failed
CI / test (push) Failing after 9s
Build and Deploy / build-and-deploy (push) Successful in 19s
Run Voxtral transcription worker with two jobs
2026-06-09 17:16:24 +03:00

4.9 KiB

AI Service

Technical AI job service for Portal workloads.

The first version owns only AI job lifecycle and metrics. Business data stays in domain services such as telephony, monitoring-tg and monitoring-pf.

Generic job contract

The service is intentionally domain-agnostic:

  • owner_service names the caller, for example telephony, monitoring-tg, monitoring-pf or a future Portal module.
  • owner_ref is the caller's stable object reference, for example beeline/{call_id} or channel/{message_id}.
  • task_type describes the technical task class, for example transcribe, call_analysis, tg_analysis, pf_competitor_analysis.
  • model_profile selects a runtime profile, for example voxtral-small, qwen2.5-14b, vision, or a future provider profile.
  • input and result are JSON payloads owned by the caller and worker.

This keeps AI service as shared infrastructure rather than a telephony-specific service.

Built-in workers

The first built-in worker processes llm_chat, chat_completion and call_analysis jobs whose model_profile equals LLM_MODEL.

Input can be either explicit messages:

{
  "messages": [
    {"role": "system", "content": "Answer as JSON."},
    {"role": "user", "content": "Classify this text"}
  ],
  "max_tokens": 256
}

or compact system / user fields. The completed job result contains content, model, usage and duration_ms.

call_analysis uses the same input contract as llm_chat; callers may include domain metadata fields in input, but the worker only reads chat fields such as system, user, messages, max_tokens and response_format.

transcription jobs are processed only by Voxtral Small (mistralai/Voxtral-Small-24B-2507) through an OpenAI-compatible /v1/audio/transcriptions endpoint. The returned segments field stays compatible with telephony. If the provider returns one long segment, AI Service splits it into smaller transcript segments and adds heuristic speaker labels when diarization is requested.

AI-server compose snippet for Voxtral lives in deploy/ai-server/docker-compose.audio.yml:

  • Voxtral endpoint: http://10.2.3.5:8004
  • Start Voxtral: docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile voxtral-small up -d voxtral-small

In Kubernetes the dedicated transcription worker may claim more than one voxtral-small job at a time. This keeps download/upload/wait overhead from serializing the queue while Voxtral/vLLM still controls the actual GPU scheduling.

API

  • POST /api/v1/jobs creates one job.
  • GET /api/v1/jobs lists jobs with query filters.
  • POST /api/v1/jobs/batch creates many jobs with shared defaults.
  • POST /api/v1/jobs/retry retries failed/running jobs by filter.
  • POST /api/v1/jobs/cancel cancels pending/running jobs by filter.
  • POST /api/v1/jobs/claim atomically claims pending jobs for a worker.
  • GET /api/v1/jobs/{id} returns technical job state and result.
  • POST /api/v1/jobs/{id}/complete stores a successful job result.
  • POST /api/v1/jobs/{id}/fail stores a failed job category and message.
  • POST /api/v1/jobs/{id}/retry resets failed/running jobs to pending.
  • GET /api/v1/stats returns queue and error counters.
  • GET /api/v1/providers/status checks configured AI providers without returning secrets.
  • GET /api/v1/infra/status returns AI-server sidecar telemetry (GPU, containers and vLLM live metrics) when configured.
  • GET /healthz returns process health.
  • GET /readyz checks PostgreSQL readiness.
  • Built-in workers expose open Kubernetes endpoints on WORKER_HTTP_PORT: GET /healthz, GET /readyz and GET /worker/status.

All /api/v1/* endpoints require Authorization: Bearer <AI_SERVICE_TOKEN> when AI_SERVICE_TOKEN is configured. Health and readiness endpoints stay open for Kubernetes probes.

Configuration

  • HTTP_HOST, default 0.0.0.0
  • HTTP_PORT, default 8080
  • DATABASE_URL, required
  • MIGRATE_ON_START, default true
  • AI_SERVICE_TOKEN, optional bearer token for service-to-service API calls
  • LLM_BASE_URL, primary OpenAI-compatible LLM endpoint
  • LLM_API_KEY, primary LLM API key
  • LLM_MODEL, default qwen2.5-14b
  • LLM_TIMEOUT, default 5m
  • VOXTRAL_BASE_URL, OpenAI-compatible endpoint for Voxtral
  • VOXTRAL_MODEL, default mistralai/Voxtral-Small-24B-2507
  • VOXTRAL_API_KEY, optional bearer token for Voxtral; falls back to AUDIO_LLM_API_KEY, then LLM_API_KEY
  • AUDIO_LLM_PROMPT, transcription instruction for Voxtral
  • WORKER_ID, default hostname
  • WORKER_HTTP_HOST, default 0.0.0.0
  • WORKER_HTTP_PORT, default 8081
  • WORKER_POLL_INTERVAL, default 2s
  • WORKER_CLAIM_LIMIT, default 4
  • WORKER_LEASE_TIMEOUT, default 15m

Next integration step

telephony should first mirror low-risk analysis jobs into this service while continuing local processing. Remote execution can then be enabled by feature flag per task type.