AI Service

Technical AI job service for Portal workloads.

AI Service owns technical AI job lifecycle, provider execution and metrics. Business data stays in domain services such as telephony, monitoring-tg and monitoring-pf.

Generic job contract

The service is intentionally domain-agnostic:

owner_service names the caller, for example telephony, monitoring-tg, monitoring-pf or a future Portal module.
owner_ref is the caller's stable object reference, for example beeline/{call_id} or channel/{message_id}.
task_type describes the technical task class, for example transcription, transcript_summary, call_analysis, telegram_classification, tg_analysis, pf_competitor_analysis.
model_profile selects a runtime profile, for example whisper-large-v3, qwen2.5-14b, vision, or a future provider profile.
input and result are JSON payloads owned by the caller and worker.

This keeps AI service as shared infrastructure rather than a telephony-specific service.

Built-in workers

The LLM worker processes llm_chat, chat_completion, call_analysis, transcript_summary and telegram_classification jobs whose model_profile equals LLM_MODEL.

Input can be either explicit messages:

{
  "messages": [
    {"role": "system", "content": "Answer as JSON."},
    {"role": "user", "content": "Classify this text"}
  ],
  "max_tokens": 256
}

or compact system / user fields. The completed job result contains schema_version=ai.chat_result.v1, content, model, usage and duration_ms.

call_analysis and transcript_summary use the same input contract as llm_chat; callers may include domain metadata fields in input, but the worker only reads chat fields such as system, user, messages, max_tokens and response_format.

transcription jobs are processed only by Whisper Large v3 (openai/whisper-large-v3) through an OpenAI-compatible /v1/audio/transcriptions endpoint. The returned segments field stays compatible with telephony. If the provider returns one long segment, AI Service splits it into smaller transcript segments without inventing speaker labels. The completed job result contains schema_version=ai.transcription_result.v1, provider, model, language, segments, optional provider attempts and duration_ms.

AI-server compose snippet for Whisper Large v3 lives in deploy/ai-server/docker-compose.audio.yml:

Whisper endpoint: http://10.2.3.5:8004
Start Whisper: docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile whisper-large-v3 up -d whisper-large-v3

In Kubernetes the dedicated transcription worker may claim more than one whisper-large-v3 job at a time. This keeps download/upload/wait overhead from serializing the queue while the Whisper provider still controls the actual GPU scheduling.

API

POST /api/v1/jobs creates one job.
GET /api/v1/jobs lists jobs with query filters.
POST /api/v1/jobs/batch creates many jobs with shared defaults.
POST /api/v1/jobs/retry retries failed/running jobs by filter.
POST /api/v1/jobs/cancel cancels pending/running jobs by filter.
POST /api/v1/jobs/claim atomically claims pending jobs for a worker.
GET /api/v1/jobs/{id} returns technical job state and result.
POST /api/v1/jobs/{id}/complete stores a successful job result.
POST /api/v1/jobs/{id}/fail stores a failed job category and message.
POST /api/v1/jobs/{id}/retry resets failed/running jobs to pending.
GET /api/v1/stats returns queue and error counters.
GET /api/v1/providers/status checks configured AI providers without returning secrets.
GET /api/v1/infra/status returns AI-server sidecar telemetry (GPU, containers and vLLM live metrics) when configured.
GET /health/detail returns PostgreSQL, provider, queue, error, throughput and infra components for Portal admin/health.
GET /healthz returns process health.
GET /readyz checks PostgreSQL readiness.
Built-in workers expose open Kubernetes endpoints on WORKER_HTTP_PORT: GET /healthz, GET /readyz and GET /worker/status.

All /api/v1/* endpoints require Authorization: Bearer <AI_SERVICE_TOKEN> when AI_SERVICE_TOKEN is configured. Health and readiness endpoints stay open for Kubernetes probes.

Retry policy

Workers store a normalized error_code on failed jobs. AI Service requeues only explicitly retryable categories while attempts remain.

Category	Retry	Delay
`provider_unavailable`, `model_unavailable`, `provider_error`, `dependency_error`, `timeout`, `storage_error`, `stale_worker`	yes	30s
`bad_response`, `transcript_hallucination`, `transcript_incomplete`, `internal_error`, `unknown`	yes	2m
`bad_audio`, `bad_input`, `context_length`, `unsupported_task`, `cancelled`	no	-

Domain services may still expose manual retry for terminal errors after the underlying data or prompt is corrected.

Result schemas

AI Service result payloads are versioned with schema_version. Consumers should ignore unknown fields and reject only unsupported major schema names.

Current schemas:

ai.chat_result.v1: {schema_version, content, model, usage?, duration_ms}.
ai.transcription_result.v1: {schema_version, provider?, model?, attempts?, language, segments, duration_ms}.

New optional fields may be added to a v1 schema without a breaking change. Breaking shape changes require a new schema name.

Configuration

HTTP_HOST, default 0.0.0.0
HTTP_PORT, default 8080
DATABASE_URL, required
MIGRATE_ON_START, default true
AI_SERVICE_TOKEN, optional bearer token for service-to-service API calls
LLM_BASE_URL, primary OpenAI-compatible LLM endpoint
LLM_API_KEY, primary LLM API key
LLM_MODEL, default qwen2.5-14b
LLM_TIMEOUT, default 5m
AUDIO_TRANSCRIPTION_BASE_URL, OpenAI-compatible transcription endpoint
AUDIO_TRANSCRIPTION_MODEL, default openai/whisper-large-v3
AUDIO_TRANSCRIPTION_API_KEY, optional bearer token; falls back to AUDIO_LLM_API_KEY, then LLM_API_KEY
AUDIO_TRANSCRIPTION_PROMPT, transcription instruction
WORKER_ID, default hostname
WORKER_HTTP_HOST, default 0.0.0.0
WORKER_HTTP_PORT, default 8081
WORKER_POLL_INTERVAL, default 2s
WORKER_CLAIM_LIMIT, default 4
WORKER_LEASE_TIMEOUT, default 15m

Current telephony pipeline

telephony now uses AI Service as the only AI execution path:

transcription turns call audio into segments.
transcript_summary creates a detailed Russian call summary.
call_analysis runs tags and negotiation rules against the summary.

6.6 KiB Raw Blame History