ai-service/README.md

# AI Service

Technical AI job service for Portal workloads.

The first version owns only AI job lifecycle and metrics. Business data stays in
domain services such as `telephony`, `monitoring-tg` and `monitoring-pf`.

## Generic job contract

The service is intentionally domain-agnostic:

- `owner_service` names the caller, for example `telephony`, `monitoring-tg`,
  `monitoring-pf` or a future Portal module.
- `owner_ref` is the caller's stable object reference, for example
  `beeline/{call_id}` or `channel/{message_id}`.
- `task_type` describes the technical task class, for example
  `transcribe`, `call_analysis`, `tg_analysis`, `pf_competitor_analysis`.
- `model_profile` selects a runtime profile, for example `whisperx`,
  `qwen2.5-14b`, `vision`, or a future provider profile.
- `input` and `result` are JSON payloads owned by the caller and worker.

This keeps AI service as shared infrastructure rather than a telephony-specific
service.

## Built-in workers

The first built-in worker processes `llm_chat`, `chat_completion` and
`call_analysis` jobs whose `model_profile` equals `LLM_MODEL`.

Input can be either explicit messages:

```json
{
  "messages": [
    {"role": "system", "content": "Answer as JSON."},
    {"role": "user", "content": "Classify this text"}
  ],
  "max_tokens": 256
}
```

or compact `system` / `user` fields. The completed job result contains
`content`, `model`, `usage` and `duration_ms`.

`call_analysis` uses the same input contract as `llm_chat`; callers may include
domain metadata fields in `input`, but the worker only reads chat fields such as
`system`, `user`, `messages`, `max_tokens` and `response_format`.

`transcription` jobs can run several transcription providers in order for
temporary A/B comparison. The main `segments` field remains compatible with
telephony and contains the first successful provider result. The full comparison
is stored in `attempts` with `provider`, `model`, `status`, `text`, `segments`,
`duration_ms` and `error`.

Recommended comparison order:

1. `whisperx`
2. `qwen2-audio` (`Qwen/Qwen2-Audio-7B-Instruct`)
3. `voxtral-small` (`mistralai/Voxtral-Small-24B-2507`)

Qwen2-Audio and Voxtral are called through an OpenAI-compatible
`/v1/chat/completions` endpoint with `input_audio`; set their endpoint URLs only
after the models are actually exposed on the AI server.

AI-server compose snippets for these temporary comparison endpoints live in
`deploy/ai-server/docker-compose.audio.yml`. They are profile-gated because the
single GPU cannot keep the production text vLLM, two WhisperX instances, Qwen2
Audio and Voxtral loaded at the same time:

- Qwen2-Audio endpoint: `http://10.2.3.5:8003`
- Voxtral endpoint: `http://10.2.3.5:8004`
- Start Qwen only:
  `docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile qwen-audio up -d qwen-audio`
- Start Voxtral only:
  `docker compose -f docker-compose.yml -f docker-compose.audio.yml --profile voxtral-small up -d voxtral-small`

## API

- `POST /api/v1/jobs` creates one job.
- `GET /api/v1/jobs` lists jobs with query filters.
- `POST /api/v1/jobs/batch` creates many jobs with shared defaults.
- `POST /api/v1/jobs/retry` retries failed/running jobs by filter.
- `POST /api/v1/jobs/cancel` cancels pending/running jobs by filter.
- `POST /api/v1/jobs/claim` atomically claims pending jobs for a worker.
- `GET /api/v1/jobs/{id}` returns technical job state and result.
- `POST /api/v1/jobs/{id}/complete` stores a successful job result.
- `POST /api/v1/jobs/{id}/fail` stores a failed job category and message.
- `POST /api/v1/jobs/{id}/retry` resets failed/running jobs to `pending`.
- `GET /api/v1/stats` returns queue and error counters.
- `GET /api/v1/providers/status` checks configured AI providers without
  returning secrets.
- `GET /api/v1/infra/status` returns AI-server sidecar telemetry
  (GPU, containers, vLLM and WhisperX live metrics) when configured.
- `GET /healthz` returns process health.
- `GET /readyz` checks PostgreSQL readiness.
- Built-in workers expose open Kubernetes endpoints on `WORKER_HTTP_PORT`:
  `GET /healthz`, `GET /readyz` and `GET /worker/status`.

All `/api/v1/*` endpoints require `Authorization: Bearer <AI_SERVICE_TOKEN>`
when `AI_SERVICE_TOKEN` is configured. Health and readiness endpoints stay open
for Kubernetes probes.

## Configuration

- `HTTP_HOST`, default `0.0.0.0`
- `HTTP_PORT`, default `8080`
- `DATABASE_URL`, required
- `MIGRATE_ON_START`, default `true`
- `AI_SERVICE_TOKEN`, optional bearer token for service-to-service API calls
- `LLM_BASE_URL`, primary OpenAI-compatible LLM endpoint
- `LLM_API_KEY`, primary LLM API key
- `LLM_MODEL`, default `qwen2.5-14b`
- `LLM_TIMEOUT`, default `5m`
- `TRANSCRIPTION_PROVIDERS`, default `whisperx`, comma-separated ordered list:
  `whisperx,qwen2-audio,voxtral-small`
- `WHISPERX_URL`, WhisperX endpoint for transcription jobs
- `QWEN_AUDIO_BASE_URL`, OpenAI-compatible endpoint for Qwen2-Audio
- `QWEN_AUDIO_MODEL`, default `Qwen/Qwen2-Audio-7B-Instruct`
- `QWEN_AUDIO_API_KEY`, optional bearer token for Qwen2-Audio; falls back to
  `AUDIO_LLM_API_KEY`, then `LLM_API_KEY`
- `VOXTRAL_BASE_URL`, OpenAI-compatible endpoint for Voxtral
- `VOXTRAL_MODEL`, default `mistralai/Voxtral-Small-24B-2507`
- `VOXTRAL_API_KEY`, optional bearer token for Voxtral; falls back to
  `AUDIO_LLM_API_KEY`, then `LLM_API_KEY`
- `AUDIO_LLM_PROMPT`, transcription instruction for audio LLM providers
- `AUDIO_LLM_MAX_TOKENS`, default `4096`
- `WORKER_ID`, default hostname
- `WORKER_HTTP_HOST`, default `0.0.0.0`
- `WORKER_HTTP_PORT`, default `8081`
- `WORKER_POLL_INTERVAL`, default `2s`
- `WORKER_CLAIM_LIMIT`, default `4`
- `WORKER_LEASE_TIMEOUT`, default `15m`

## Next integration step

`telephony` should first mirror low-risk analysis jobs into this service while
continuing local processing. Remote execution can then be enabled by feature
flag per task type.