RAG (Retrieval-Augmented Generation)

Managed vector search and grounded LLM generation. Upload documents, search semantically, and generate responses grounded in your data — all through a single API.

How It Works

1. Create a collection (namespace for documents)
2. Upload documents → auto-chunked and embedded
3. Search: vector / hybrid / keyword retrieval
4. Query: search + LLM generation in one call

Collections

Collections group related documents. Each collection uses a configurable embedding model and chunking strategy.

curl https://api.agilecloud.ai/api/v1/rag/collections \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "knowledge-base",
    "description": "Internal documentation",
    "embedding_model": "text-embedding-3-small",
    "chunk_size": 512,
    "chunk_overlap": 50
  }'

Uploading Documents

Documents are auto-chunked, embedded, and indexed. Supported formats: plain text, Markdown, PDF, HTML.

curl https://api.agilecloud.ai/api/v1/rag/collections/coll_abc123/documents \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "documents": [
      {
        "content": "ACAI provides managed inference with compliance built-in...",
        "metadata": {"source": "docs", "section": "overview"}
      }
    ]
  }'

Search

Search across documents using vector similarity, keyword matching, or hybrid (combined) retrieval.

curl https://api.agilecloud.ai/v1/rag/search \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "coll_abc123",
    "query": "How does autoscaling work?",
    "search_type": "hybrid",
    "top_k": 5
  }'

Search Type	Description
vector	Cosine similarity on embeddings
keyword	BM25 keyword matching
hybrid	Combined vector + keyword (reciprocal rank fusion)

Query (Search + Generate)

One-step retrieval-augmented generation. Searches your collection, constructs a grounded prompt, and streams a response from the LLM.

curl https://api.agilecloud.ai/v1/rag/query \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "coll_abc123",
    "query": "How does model autoscaling work?",
    "model": "gpt-4o-mini",
    "top_k": 5,
    "stream": true
  }'

The response includes the generated text plus the source chunks used for grounding, so you can display citations.

Endpoints

Method	Path	Description
POST	/api/v1/rag/collections	Create collection
GET	/api/v1/rag/collections	List collections
POST	/api/v1/rag/collections/{id}/documents	Upload documents
GET	/api/v1/rag/collections/{id}/documents	List documents
POST	/v1/rag/search	Vector / hybrid / keyword search
POST	/v1/rag/query	Search + LLM generation
GET	/api/v1/rag/usage	Storage usage and tier cap

Tier Availability

RAG is available on Business and above. Storage limits scale by tier. Embedding and LLM tokens used during ingestion and query are billed at standard rates.