Semantic Cache

Deduplicate similar requests using embedding-based similarity matching. Cache hits skip inference entirely — zero latency, zero cost.

How It Works

When a request arrives, DirectAI computes a normalized embedding of the input. If a cached entry exists within the configured similarity threshold, the cached response is returned immediately. Otherwise the request proceeds to inference and the result is cached.

Request → Normalize → Embed → Similarity Search
  ├── Hit (≥ threshold)  → Return cached response (0ms inference)
  └── Miss (< threshold) → Run inference → Cache result → Return

Configuration

Configure the cache via the management API or the dashboard.

curl https://api.agilecloud.ai/api/v1/cache/config \
  -X PATCH \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "similarity_threshold": 0.95,
    "ttl_seconds": 3600,
    "max_entries": 10000
  }'
ParameterDefaultDescription
enabledfalseEnable/disable semantic cache
similarity_threshold0.95Cosine similarity threshold (0.0–1.0)
ttl_seconds3600Time-to-live for cached entries
max_entries10000Maximum cached entries before eviction

Cache Statistics

curl https://api.agilecloud.ai/api/v1/cache/stats \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response
{
  "total_requests": 15420,
  "cache_hits": 4812,
  "cache_misses": 10608,
  "hit_rate": 0.312,
  "estimated_savings_usd": 24.50,
  "avg_hit_latency_ms": 2.1,
  "avg_miss_latency_ms": 340.5
}

Endpoints

MethodPathDescription
GET/api/v1/cache/statsHit/miss statistics
GET/api/v1/cache/configGet configuration
PATCH/api/v1/cache/configUpdate configuration
GET/api/v1/cache/entriesList cached entries
POST/api/v1/cache/invalidateInvalidate by model or hash
DELETE/api/v1/cache/flushFlush all entries

Request Normalization

Before computing the cache key, requests are normalized: whitespace is collapsed, system prompts are canonicalized, and parameters like temperature and max_tokens are included in the cache key. Two requests with identical content but different temperature values are treated as different cache keys.

Tier Availability

Semantic cache is available on Pro and above. Free tier users see cache endpoints return 403 with an upgrade prompt.