Architecture

ACAI is a stateless compliance gateway that sits between clients and inference backends. It handles auth, compliance, and observability — inference runs on managed serverless endpoints or customer-registered providers (BYOB).

High-Level Overview

┌─────────────┐     ┌──────────────────────┐     ┌──────────────────┐
│   Client     │────▶│   API Server         │────▶│  Managed         │
│  (OpenAI SDK)│     │  (FastAPI proxy)      │     │  Inference (MaaS)│
└─────────────┘     │                      │     └──────────────────┘
                    │  • Auth (Bearer)      │
                    │  • Rate limiting      │────▶┌──────────────────┐
                    │  • Guardrails / PII   │     │  BYOB Providers  │
                    │  • Audit logging      │     │  (OpenAI/Claude) │
                    │  • Semantic cache     │     └──────────────────┘
                    │  • Model routing      │
                    │  • Usage metering     │────▶┌──────────────────┐
                    │  • Billing (Stripe)   │     │  Ollama          │
                    └──────────────────────┘     │  (local dev)     │
                                                 └──────────────────┘

Request Flow

Client sends request to API server with Bearer token
Auth middleware validates API key (SHA-256 lookup with TTL cache)
Rate limiter checks token-bucket quota for the key
Correlation ID assigned (or propagated from X-Request-ID header)
Guardrails evaluate request (content safety, PII, injection)
Model registry resolves model name/alias to backend service
Request proxied to inference backend via httpx HTTP/2
Response streamed back through guardrails check
Usage metered (tokens counted, events queued for Stripe)
Audit event written

Inference Engines

Engine	Modality	Status	Notes
Managed Catalog (Serverless)	Chat, Embeddings, Transcription	Active	Serverless model-as-a-service — zero idle cost, auto-scaling, tier-gated catalog.
BYOB (Bring Your Own Backend)	All (provider-dependent)	Active	Customer-registered providers (OpenAI, Anthropic, Azure AI Foundry, etc.). Full compliance layer applied.
Ollama	LLMs (local dev)	Dev only	Local development backend via backendUrl override.

Infrastructure

Component	Service	Purpose
Inference	Serverless MaaS	Managed model catalog — zero idle cost, auto-scaling
Orchestration	AKS	API server pods, web frontend, horizontal autoscaling
Audit Storage	Blob Storage	Tamper-proof audit logs, compliance report exports
Images	ACR	API server and web frontend container images
Autoscaling	HPA	CPU/memory-based horizontal pod autoscaling
Observability	OpenTelemetry + Prometheus	Metrics, logs, distributed tracing
Secrets	Key Vault	API keys, connection strings, provider secrets
Database	PostgreSQL Flexible Server	Users, sessions, API keys, usage records

Customer Isolation

Each customer gets their own isolated cloud subscription. This provides complete isolation of billing, networking, identity, and blast radius. Resources are deployed via Infrastructure-as-Code templates.

Separate VNet per customer stamp
Dedicated Key Vault with RBAC authorization
Per-stamp Log Analytics workspace
Two managed identities per stamp (control plane + kubelet) with least-privilege RBAC

Multi-Cloud Design

Every cloud-specific component sits behind an interface. Multi-cloud portability is designed in from day one. No provider-specific assumptions are hardcoded in the application layer.