Semantic Cache
Deduplicate similar requests using embedding-based similarity matching. Cache hits skip inference entirely — zero latency, zero cost.
How It Works
When a request arrives, DirectAI computes a normalized embedding of the input. If a cached entry exists within the configured similarity threshold, the cached response is returned immediately. Otherwise the request proceeds to inference and the result is cached.
Request → Normalize → Embed → Similarity Search ├── Hit (≥ threshold) → Return cached response (0ms inference) └── Miss (< threshold) → Run inference → Cache result → Return
Configuration
Configure the cache via the management API or the dashboard.
curl https://api.agilecloud.ai/api/v1/cache/config \
-X PATCH \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"enabled": true,
"similarity_threshold": 0.95,
"ttl_seconds": 3600,
"max_entries": 10000
}'| Parameter | Default | Description |
|---|---|---|
| enabled | false | Enable/disable semantic cache |
| similarity_threshold | 0.95 | Cosine similarity threshold (0.0–1.0) |
| ttl_seconds | 3600 | Time-to-live for cached entries |
| max_entries | 10000 | Maximum cached entries before eviction |
Cache Statistics
curl https://api.agilecloud.ai/api/v1/cache/stats \
-H "Authorization: Bearer YOUR_API_KEY"
# Response
{
"total_requests": 15420,
"cache_hits": 4812,
"cache_misses": 10608,
"hit_rate": 0.312,
"estimated_savings_usd": 24.50,
"avg_hit_latency_ms": 2.1,
"avg_miss_latency_ms": 340.5
}Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/cache/stats | Hit/miss statistics |
| GET | /api/v1/cache/config | Get configuration |
| PATCH | /api/v1/cache/config | Update configuration |
| GET | /api/v1/cache/entries | List cached entries |
| POST | /api/v1/cache/invalidate | Invalidate by model or hash |
| DELETE | /api/v1/cache/flush | Flush all entries |
Request Normalization
Before computing the cache key, requests are normalized: whitespace is collapsed, system prompts are canonicalized, and parameters like temperature and max_tokens are included in the cache key. Two requests with identical content but different temperature values are treated as different cache keys.
Tier Availability
Semantic cache is available on Pro and above. Free tier users see cache endpoints return 403 with an upgrade prompt.