Deduplicate similar requests using embedding-based similarity matching. Cache hits skip inference entirely — zero latency, zero cost.
When a request arrives, ACAI computes a normalized embedding of the input. If a cached entry exists within the configured similarity threshold, the cached response is returned immediately. Otherwise the request proceeds to inference and the result is cached.
Request → Normalize → Embed → Similarity Search ├── Hit (≥ threshold) → Return cached response (0ms inference) └── Miss (< threshold) → Run inference → Cache result → Return
Configure the cache via the management API or the dashboard.
curl https://api.agilecloud.ai/api/v1/cache/config \
-X PATCH \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"enabled": true,
"similarity_threshold": 0.95,
"ttl_seconds": 3600,
"max_entries": 10000
}'| Parameter | Default | Description |
|---|---|---|
| enabled | false | Enable/disable semantic cache |
| similarity_threshold | 0.95 | Cosine similarity threshold (0.0–1.0) |
| ttl_seconds | 3600 | Time-to-live for cached entries |
| max_entries | 10000 | Maximum cached entries before eviction |
curl https://api.agilecloud.ai/api/v1/cache/stats \
-H "Authorization: Bearer YOUR_API_KEY"
# Response
{
"total_requests": 15420,
"cache_hits": 4812,
"cache_misses": 10608,
"hit_rate": 0.312,
"estimated_savings_usd": 24.50,
"avg_hit_latency_ms": 2.1,
"avg_miss_latency_ms": 340.5
}| Method | Path | Description |
|---|---|---|
| GET | /api/v1/cache/stats | Hit/miss statistics |
| GET | /api/v1/cache/config | Get configuration |
| PATCH | /api/v1/cache/config | Update configuration |
| GET | /api/v1/cache/entries | List cached entries |
| POST | /api/v1/cache/invalidate | Invalidate by model or hash |
| DELETE | /api/v1/cache/flush | Flush all entries |
Before computing the cache key, requests are normalized: whitespace is collapsed, system prompts are canonicalized, and parameters like temperature and max_tokens are included in the cache key. Two requests with identical content but different temperature values are treated as different cache keys.
Semantic cache is available on Pro and above. Free tier users see cache endpoints return 403 with an upgrade prompt.