Semantic Cache

Deduplicate similar requests using embedding-based similarity matching. Cache hits skip inference entirely — zero latency, zero cost.

How It Works

When a request arrives, ACAI computes a normalized embedding of the input. If a cached entry exists within the configured similarity threshold, the cached response is returned immediately. Otherwise the request proceeds to inference and the result is cached.

Request → Normalize → Embed → Similarity Search
  ├── Hit (≥ threshold)  → Return cached response (0ms inference)
  └── Miss (< threshold) → Run inference → Cache result → Return

Configuration

Configure the cache via the management API or the dashboard.

curl https://api.agilecloud.ai/api/v1/cache/config \
  -X PATCH \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "similarity_threshold": 0.95,
    "ttl_seconds": 3600,
    "max_entries": 10000
  }'

Parameter	Default	Description
enabled	false	Enable/disable semantic cache
similarity_threshold	0.95	Cosine similarity threshold (0.0–1.0)
ttl_seconds	3600	Time-to-live for cached entries
max_entries	10000	Maximum cached entries before eviction

Cache Statistics

curl https://api.agilecloud.ai/api/v1/cache/stats \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response
{
  "total_requests": 15420,
  "cache_hits": 4812,
  "cache_misses": 10608,
  "hit_rate": 0.312,
  "estimated_savings_usd": 24.50,
  "avg_hit_latency_ms": 2.1,
  "avg_miss_latency_ms": 340.5
}

Endpoints

Method	Path	Description
GET	/api/v1/cache/stats	Hit/miss statistics
GET	/api/v1/cache/config	Get configuration
PATCH	/api/v1/cache/config	Update configuration
GET	/api/v1/cache/entries	List cached entries
POST	/api/v1/cache/invalidate	Invalidate by model or hash
DELETE	/api/v1/cache/flush	Flush all entries

Request Normalization

Before computing the cache key, requests are normalized: whitespace is collapsed, system prompts are canonicalized, and parameters like temperature and max_tokens are included in the cache key. Two requests with identical content but different temperature values are treated as different cache keys.

Tier Availability

Semantic cache is available on Pro and above. Free tier users see cache endpoints return 403 with an upgrade prompt.